Statistical Progamming and Modelling with Montseratt Meteorological Data¶

Curiosty Project¶

Researcher and CUNY City Tech Graduated Alumni: Le' Sean Roberts
Contact:

  • Email: lesean.roberts85@gmail.com
  • Phone: (718) 559-7671 or (868) 383-9658

Date: [24/06/2025]

Abstract¶

This project concerns exploration of the applications of data wrangling, exploratory data analysis (EDA) techniques, statistical analysis, stochastic models and machine learning methods on historical weather data to extract valuable insights and patterns. The advantages of daily and hourly meteorological data will be leveraged to apply the prior mentioned tools and subjects.

Introduction¶

Meteorological data, with its inherent variability and complex patterns, presents an ideal playground for the application of statistical and stochastic applications. These mathematical tools allows one to delve into the intricacies of weather phenomena, uncovering hidden trends, making accurate predictions, and gaining deeper insights into climate systems.

Statistical analysis forms the foundation for comprehending meteorological data. Primitively, descriptive statistics can be employed, being the view of key characteristics such as mean, median, mode, standard deviation (and variance). Such measures provide a quantitative overview of the data's central tendency and dispersion. Also, distribution investigation is possible with tools such as measures of skewness and kurtosis, along with histograms and quantile-quantile plots. For the case of continuous attributes, the use of Pearson correlation can provide insights on possible linearity associations between attributes.

Inferential statistics takes the analysis further by drawing conclusions about a population (or period of weather data) based on a sample. Hypothesis testing allows for assessing the significance of observed differences or relationships. For example, there's ability to test whether the average temperature in a particular region has increased over the past century, or whether there's significant difference between two periods. Prior development in statistical analysis (descriptive statistics, skewness, kurtosis, and others) to be a premature guide on the types of inferential statistics tools to be applied. Data is to be studied, and not to be making loose assumptions upon it.

While statistical methods provide valuable insights, they often fall short in capturing inherent randomness and temporal dependence present in meteorological data. Stochastic processes, on the other hand, are mathematical models that describe the evolution of a system over time., incorporating deterministic and random components.

One of the commonly applied stochastic processes in meteorology is the autoregressive (AR) model. This model assumes that the current value of a variable is a linear function of its past values, along with a random error term. AR models are particularly useful for forecasting time series data. In general, processes that extend the AR model may be included in this project. Statistical and stochastic processes have numerous application in meteorology, including:

  1. Climate Modeling: These techniques are essential for developing complex climate models that simulate the Earth's climate system and predict future climate change.

  2. Climate Change Detection and Attribution: Statistical techniques are applied to detect trends in climate data and attribute these trends to human activities or natural variability.

  3. Extreme Event Analysis: These methods help analyze extreme weather events, such as hurricanes, floods, and heatwaves, and assess their potential impacts.

  4. Weather Forecasting: Statistical and stochastic models are used to improve the accuracy of weather forecasts, especially for short-term predictions.

This project assimilates meteorological data focused on Montserrat, an island in the Caribbean of the Lesser Antilles. Units for such meteorological data is expressed in the metric system; measures of length or depth are in millimeters however, and km/h for the case of windspeed. The time format is of ISO 8601.

Repositories from various government agencies such as the National Oceanic and Atmospheric Administration (NOAA), National Weather Service (NWS), and National Centers for Environmental Information (NCEI), together with revered resources such as Kaggle and Open-Meteo API, proved to be valuable resources for this project.

None of the prior mentioned endeavors are possible without competent data assimilation and wrangling processes and practices. On many occasions data wrangling can be applied towards various interests in modeling and exploratory data analysis. The more knowledge and vigor one has, such results in the need for dedication and unorthodox pursuits, being essential features for revolutionary development w.r.t. to time and resources. Data assimilation and data wrangling can be quite a challenge w.r.t. the types of programming languages applied. This project leverages the Python programming language, requiring much more technical development and patience than a conventional statistical development language such as R. This project is heavier on the programming side than on the mathematical side because there's emphasis on getting substantial development, rather than clique ideologies and luxurious "espieglerie". On many occasions throughout the process, various errors or concerns are encountered due to the natural being of applied data files; cases of mixed data types, bad entries, missing data, instances appearing as a particular data type but being something else; values incompatible with particular models for convergence, conventions triggering warnings, deprecated customs, and so forth.

Furthermore, this project also leverages various types of machine learning methods, such as supervised learning (common regression, and classification), unsupervised learning (histogram-based outlier score and local outlier factor), and ensemble learning (random forests and XGBoost). Such a course can be a respectable substitute for numerical weather prediction (NWP) models when computational complexity is not desired, and time is limited for research development. However, large data sets are generally required, which may at times sully the prestige of machine learning models compared to NWP models. Poor models are natural; decent or good/great models require much ingenuity sans creating underlying bias or fraudulence.

Methodology¶

A conventional idea about statistical programming or data science, the following steps are statutory:

  1. Data Collection and Cleaning

  2. Exploratory Data Analysis

  3. Feature Engineering

  4. Model Development

  5. Validation and Testing

  6. Dashboard Development

However, in real practice there may be a back-and-forth process among various steps depending on what pursuits there are. Classification, building classes or feature engineering can arise on multiple occasions to develop robust models. This project often implements such unconventional activity.

Declaration¶

This project is designed with the intent to provide readers with the tools and structure necessary to critically investigate and assess its methodology. It does not aim to create superficial or overly simplified outcomes, but rather fosters a genuine opportunity for analysis and constructive critique. By utilizing the Python programming language, the project enables progressive and continuous development, empowering a global community of researchers, regardless of socioeconomic background or time constraints.

While the project is an elementary step, it represents a foundational development crucial for advancing serious data analysis work. It is important to recognize that data analysis cannot be confined to classroom norms and customs, as not everything can be fully taught in an academic setting. This project assumes that readers already possess a strong foundational knowledge of basic data analysis, as it is not intended as a teaching tool. Instead, it is developed with the expectation that serious, real-world applications in commerce and development are the ultimate goals. My intention is not to teach, as I do not have the time or resources to engage in teaching roles. This project is a step toward advancing the field and should be viewed as such.

Daily Meteorological Data¶

Daily meteorological/climate data, a collection of measurements taken over a 24-hour period, provides a vital foundation for understanding climate trends and predicting future atmospheric conditions. This data encompasses variables such as temperature, precipitation, wind speed and direction, atmospheric pressure, etc.

The collection of daily meteorological data relies on a network of weather stations equipped with specialized instruments. Thermometers measure temperature, rain gauges collect precipitation, anemometers gauge wind speed and direction, barometers measure atmospheric pressure, and so forth. This data is then processed and analyzed by meteorologists to extract meaningful insights.

The applications of daily meteorological data are diverse and far-reaching. In the realm of weather forecasting, this data serves as the cornerstone for predicting future weather patterns, enabling individuals and organizations to plan activities and make informed decisions. Climate studies rely on long-term analysis of daily meteorological data to identify trends, understand climate variability, and assess the impacts of climate change. In agriculture, daily meteorological data plays a crucial role in optimizing planting and harvesting schedules, irrigation practices, and pest control strategies. Additionally, the energy sector relies on daily meteorological data to forecast energy demand, ensuring efficient grid management and resource allocation. Meteorological data, collected at various temporal resolutions, provides invaluable insights into weather patterns. While hourly data offers detailed information about short-term weather events, daily data is often more suitable for analyzing long-term climate trends. This distinction arises from several key factors.

Firstly, the sheer volume of hourly data can be overwhelming, particularly when dealing with extensive datasets spanning multiple years. With hourly data the frequency of observations can obscure underlying patterns and make it difficult to identify significant trends. In contrast, daily data, aggregated from hourly observations, reduces the noise, and allows for a more focused analysis of long-term climate signals.

Secondly, daily data often incorporates averaging or smoothing techniques, which can help to mitigate the impact of short-term weather fluctuations. These techniques reduce the variability of the data, making it easier to discern underlying trends and patterns. Hourly data, on the other hand, may be more susceptible to the influence of transient weather events, such as thunderstorms or brief temperature spikes, which can obscure long-term climate signals.

Thirdly, climate signals, including seasonal variations, long-term warming or cooling trends, and decadal oscillations, are typically more pronounced at the daily timescale. Hourly data can be more sensitive to short-term weather phenomena, which may mask these larger-scale patterns. By focusing on daily averages, researchers can better isolate and analyze the long-term climate signals embedded within the data. Moreover, many statistical methods used in climate analysis, such as correlation analysis and regression modeling, are better suited to daily data due to its reduced variability and the potential for more robust statistical relationships. These methods can help to identify meaningful connections between climate variables and underlying drivers, providing valuable insights into climate dynamics.

Finally, daily data is generally less computationally intensive to store, and process compared to hourly data, which can be particularly important for large datasets spanning multiple decades. This efficiency allows researchers to work with larger datasets and conduct more complex analyses. While hourly meteorological data is essential for understanding short-term weather events, daily data offers a more effective lens for examining long-term climate trends and variability. By reducing noise, focusing on broader patterns, and facilitating statistical analysis, daily data provides a valuable resource for climate scientists and researchers seeking to understand the Earth’s climate system. Daily meteorological or climate data is an invaluable resource that underpins our understanding of the Earth’s atmosphere and its complex systems. By collecting, processing, and analyzing this data, we gain valuable insights into weather patterns, climate trends, and the impacts of environmental factors. This information is essential for informed decision-making, sustainable development, and the well-being of our planet.

The daily data applied stems from the Open-Meteo API. The data ranges from year 1990 to year 2025.

Data Wrangling: Data Assimilation, Data Frames and Cleaning¶

Assimilating data from sources such as repositories, databases and APIs is common practice today, requiring basic scripts to retrieve data with respect to unique parameters.

Data cleaning generally concerns identifying and correcting errors, inconsistencies, or missing values. This may involve tasks such as removing duplicates, imputing missing data, etc.

In [8]:
import openmeteo_requests

import pandas as pd
import requests_cache
from retry_requests import retry

# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)

# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://archive-api.open-meteo.com/v1/archive"
params = {
	"latitude": 16.7425,
	"longitude": -62.1874,
	"start_date": "1980-01-08",
	"end_date": "2025-06-24",
	"daily": ["temperature_2m_mean", "temperature_2m_max", "temperature_2m_min", "apparent_temperature_mean", "apparent_temperature_max", "apparent_temperature_min", "wind_speed_10m_max", "et0_fao_evapotranspiration", "rain_sum", "dew_point_2m_max", "dew_point_2m_min", "surface_pressure_max", "surface_pressure_min", "pressure_msl_max", "pressure_msl_min", "relative_humidity_2m_max", "relative_humidity_2m_min", "wet_bulb_temperature_2m_max", "wet_bulb_temperature_2m_min", "vapour_pressure_deficit_max", "soil_temperature_0_to_7cm_mean"],
	"timezone": "auto"
}
responses = openmeteo.weather_api(url, params=params)

# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()}{response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")

# Process daily data. The order of variables needs to be the same as requested.
daily = response.Daily()
daily_temperature_2m_mean = daily.Variables(0).ValuesAsNumpy()
daily_temperature_2m_max = daily.Variables(1).ValuesAsNumpy()
daily_temperature_2m_min = daily.Variables(2).ValuesAsNumpy()
daily_apparent_temperature_mean = daily.Variables(3).ValuesAsNumpy()
daily_apparent_temperature_max = daily.Variables(4).ValuesAsNumpy()
daily_apparent_temperature_min = daily.Variables(5).ValuesAsNumpy()
daily_wind_speed_10m_max = daily.Variables(6).ValuesAsNumpy()
daily_et0_fao_evapotranspiration = daily.Variables(7).ValuesAsNumpy()
daily_rain_sum = daily.Variables(8).ValuesAsNumpy()
daily_dew_point_2m_max = daily.Variables(9).ValuesAsNumpy()
daily_dew_point_2m_min = daily.Variables(10).ValuesAsNumpy()
daily_surface_pressure_max = daily.Variables(11).ValuesAsNumpy()
daily_surface_pressure_min = daily.Variables(12).ValuesAsNumpy()
daily_pressure_msl_max = daily.Variables(13).ValuesAsNumpy()
daily_pressure_msl_min = daily.Variables(14).ValuesAsNumpy()
daily_relative_humidity_2m_max = daily.Variables(15).ValuesAsNumpy()
daily_relative_humidity_2m_min = daily.Variables(16).ValuesAsNumpy()
daily_wet_bulb_temperature_2m_max = daily.Variables(17).ValuesAsNumpy()
daily_wet_bulb_temperature_2m_min = daily.Variables(18).ValuesAsNumpy()
daily_vapour_pressure_deficit_max = daily.Variables(19).ValuesAsNumpy()
daily_soil_temperature_0_to_7cm_mean = daily.Variables(20).ValuesAsNumpy()

daily_data = {"date": pd.date_range(
	start = pd.to_datetime(daily.Time(), unit = "s", utc = True),
	end = pd.to_datetime(daily.TimeEnd(), unit = "s", utc = True),
	freq = pd.Timedelta(seconds = daily.Interval()),
	inclusive = "left"
)}

daily_data["temperature_2m_mean"] = daily_temperature_2m_mean
daily_data["temperature_2m_max"] = daily_temperature_2m_max
daily_data["temperature_2m_min"] = daily_temperature_2m_min
daily_data["apparent_temperature_mean"] = daily_apparent_temperature_mean
daily_data["apparent_temperature_max"] = daily_apparent_temperature_max
daily_data["apparent_temperature_min"] = daily_apparent_temperature_min
daily_data["wind_speed_10m_max"] = daily_wind_speed_10m_max
daily_data["et0_fao_evapotranspiration"] = daily_et0_fao_evapotranspiration
daily_data["rain_sum"] = daily_rain_sum
daily_data["dew_point_2m_max"] = daily_dew_point_2m_max
daily_data["dew_point_2m_min"] = daily_dew_point_2m_min
daily_data["surface_pressure_max"] = daily_surface_pressure_max
daily_data["surface_pressure_min"] = daily_surface_pressure_min
daily_data["pressure_msl_max"] = daily_pressure_msl_max
daily_data["pressure_msl_min"] = daily_pressure_msl_min
daily_data["relative_humidity_2m_max"] = daily_relative_humidity_2m_max
daily_data["relative_humidity_2m_min"] = daily_relative_humidity_2m_min
daily_data["wet_bulb_temperature_2m_max"] = daily_wet_bulb_temperature_2m_max
daily_data["wet_bulb_temperature_2m_min"] = daily_wet_bulb_temperature_2m_min
daily_data["vapour_pressure_deficit_max"] = daily_vapour_pressure_deficit_max
daily_data["soil_temperature_0_to_7cm_mean"] = daily_soil_temperature_0_to_7cm_mean

daily_dataframe = pd.DataFrame(data = daily_data)
print(daily_dataframe)
Coordinates 16.76625633239746°N -62.20843505859375°E
Elevation 309.0 m asl
Timezone b'America/Montserrat'b'GMT-4'
Timezone difference to GMT+0 -14400 s
                           date  temperature_2m_mean  temperature_2m_max  \
0     1980-01-08 04:00:00+00:00            23.374834           24.141499   
1     1980-01-09 04:00:00+00:00            23.264421           23.891499   
2     1980-01-10 04:00:00+00:00            22.322748           23.191502   
3     1980-01-11 04:00:00+00:00            22.587332           23.341499   
4     1980-01-12 04:00:00+00:00            21.306086           22.091499   
...                         ...                  ...                 ...   
16600 2025-06-20 04:00:00+00:00            25.351082           26.199001   
16601 2025-06-21 04:00:00+00:00            25.390665           25.898998   
16602 2025-06-22 04:00:00+00:00            25.317749           25.898998   
16603 2025-06-23 04:00:00+00:00                  NaN           25.848999   
16604 2025-06-24 04:00:00+00:00                  NaN                 NaN   

       temperature_2m_min  apparent_temperature_mean  \
0               22.191502                  22.092840   
1               22.191502                  22.358231   
2               21.341499                  21.067259   
3               21.841499                  19.905577   
4               20.541500                  19.145449   
...                   ...                        ...   
16600           24.848999                  25.104864   
16601           24.699001                  25.419016   
16602           24.449001                  24.848602   
16603           25.098999                        NaN   
16604                 NaN                        NaN   

       apparent_temperature_max  apparent_temperature_min  wind_speed_10m_max  \
0                     23.520189                 20.983297           37.212578   
1                     23.697132                 21.602598           36.896046   
2                     22.371422                 19.988932           35.654541   
3                     20.436180                 18.984425           42.072281   
4                     19.637054                 18.262983           40.104061   
...                         ...                       ...                 ...   
16600                 27.231419                 23.766788           40.882591   
16601                 27.573139                 24.278919           38.166790   
16602                 26.219694                 23.004978           44.039349   
16603                 25.357843                 23.626095           42.990990   
16604                       NaN                       NaN                 NaN   

       et0_fao_evapotranspiration  rain_sum  ...  surface_pressure_max  \
0                        3.982460       1.5  ...            983.794922   
1                        3.946293       0.8  ...            984.397400   
2                        3.259691       2.7  ...            983.913513   
3                        4.604709       0.5  ...            983.572449   
4                        2.766571       5.7  ...            982.082092   
...                           ...       ...  ...                   ...   
16600                    4.981394       0.1  ...            983.506775   
16601                    5.119689       0.0  ...            983.344971   
16602                    5.130907       1.0  ...            982.319397   
16603                         NaN       NaN  ...            981.898865   
16604                         NaN       NaN  ...                   NaN   

       surface_pressure_min  pressure_msl_max  pressure_msl_min  \
0                980.577454       1019.299988       1016.099976   
1                981.443359       1019.900024       1016.900024   
2                980.805786       1019.599976       1016.299988   
3                980.355164       1019.099976       1015.900024   
4                978.976501       1017.799988       1014.599976   
...                     ...               ...               ...   
16600            981.255981       1018.700012       1016.500000   
16601            980.240479       1018.700012       1015.400024   
16602            979.411743       1017.500000       1014.500000   
16603            979.643860       1017.099976       1014.799988   
16604                   NaN               NaN               NaN   

       relative_humidity_2m_max  relative_humidity_2m_min  \
0                     87.652779                 70.725937   
1                     87.906815                 73.156029   
2                     90.619431                 71.578697   
3                     81.800613                 61.149487   
4                     89.427284                 78.321884   
...                         ...                       ...   
16600                 86.541199                 70.866669   
16601                 85.219734                 72.591751   
16602                 86.767601                 72.591751   
16603                 84.229759                 75.320984   
16604                       NaN                       NaN   

       wet_bulb_temperature_2m_max  wet_bulb_temperature_2m_min  \
0                        21.027277                    20.169138   
1                        20.914402                    20.337797   
2                        20.636232                    18.998484   
3                        19.724335                    17.843048   
4                        19.959215                    19.202456   
...                            ...                          ...   
16600                    23.118631                    21.683819   
16601                    22.751518                    22.099451   
16602                    22.906918                    21.904879   
16603                    23.149427                    22.411777   
16604                          NaN                          NaN   

       vapour_pressure_deficit_max  soil_temperature_0_to_7cm_mean  
0                         0.880710                       24.816500  
1                         0.795568                       24.729010  
2                         0.783625                       24.678999  
3                         1.107534                       24.629000  
4                         0.576288                       24.578997  
...                            ...                             ...  
16600                     0.984500                       26.217749  
16601                     0.912614                       26.238586  
16602                     0.912614                       26.267754  
16603                     0.821694                             NaN  
16604                          NaN                             NaN  

[16605 rows x 22 columns]

Daily Meteorological Attributes¶

  1. temperature_2m_mean °C: Mean daily air temperature at 2 meters above ground.

  2. temperature_2m_max and temperature_2m_min °C: Maximum and minimum daily air temperature at 2 meters above ground.

  3. apparent_temperature_max and apparent_temperature_min °C: Mean, Maximum and minimum daily apparent temperature.

  4. rain_sum (mm): Sum of daily rain

  5. wind_speed_10m_max and wind_gusts_10m_max (km/h (mph, m/s, knots)): Maximum wind speed and gusts on a day.

  6. et0_fao_evapotranspiration (mm): Daily sum of ETO Reference Evapotranspiration of a well-watered grass field per day in Megajoules.

  7. surface_pressure_max and surface_pressure_min (hPa): Surface pressure

  8. pressure_msl_max and pressure_msl_min (hPa): Maximum and minimum atmospheric air pressure reduced to mean sea level (msl) or pressure at surface. Typically pressure on mean sea level is used in meteorology.

  9. relative_humidity_2m_max and relative_humidity_2m_min (%): Maximum and minimum relative humidity at 2 meters above ground.

  10. wet_bulb_temperature_2m_max and wet_bulb_temperature_2m_min (°C): maximum and minimum lowest temperature that can be reached by evaporating water into the air at a constant pressure.

  11. vapour_pressure_deficit (kPa): Maximum Vapor Pressure Deificit (VPD) in kilopascal (kPa). For high VPD (>1.6), water transpiration of plants increases. For low VPD (<0.4), transpiration decreases.

  12. soil_temperature_0_to_7cm_mean (°C): Mean average temperature of different soil levels below ground.

  13. dew_point_2m_max and dew_point_2m_min (°C): Maximum and minimum dew point temperature at 2 meters above ground.

In [10]:
# Dropping missing values
# Observing attribute data properties
daily_dataframe_clean = daily_dataframe.dropna()
daily_dataframe_clean.info()
<class 'pandas.core.frame.DataFrame'>
Index: 16603 entries, 0 to 16602
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype              
---  ------                          --------------  -----              
 0   date                            16603 non-null  datetime64[ns, UTC]
 1   temperature_2m_mean             16603 non-null  float32            
 2   temperature_2m_max              16603 non-null  float32            
 3   temperature_2m_min              16603 non-null  float32            
 4   apparent_temperature_mean       16603 non-null  float32            
 5   apparent_temperature_max        16603 non-null  float32            
 6   apparent_temperature_min        16603 non-null  float32            
 7   wind_speed_10m_max              16603 non-null  float32            
 8   et0_fao_evapotranspiration      16603 non-null  float32            
 9   rain_sum                        16603 non-null  float32            
 10  dew_point_2m_max                16603 non-null  float32            
 11  dew_point_2m_min                16603 non-null  float32            
 12  surface_pressure_max            16603 non-null  float32            
 13  surface_pressure_min            16603 non-null  float32            
 14  pressure_msl_max                16603 non-null  float32            
 15  pressure_msl_min                16603 non-null  float32            
 16  relative_humidity_2m_max        16603 non-null  float32            
 17  relative_humidity_2m_min        16603 non-null  float32            
 18  wet_bulb_temperature_2m_max     16603 non-null  float32            
 19  wet_bulb_temperature_2m_min     16603 non-null  float32            
 20  vapour_pressure_deficit_max     16603 non-null  float32            
 21  soil_temperature_0_to_7cm_mean  16603 non-null  float32            
dtypes: datetime64[ns, UTC](1), float32(21)
memory usage: 1.6 MB
In [11]:
daily_dataframe_clean.isna().sum()
Out[11]:
date                              0
temperature_2m_mean               0
temperature_2m_max                0
temperature_2m_min                0
apparent_temperature_mean         0
apparent_temperature_max          0
apparent_temperature_min          0
wind_speed_10m_max                0
et0_fao_evapotranspiration        0
rain_sum                          0
dew_point_2m_max                  0
dew_point_2m_min                  0
surface_pressure_max              0
surface_pressure_min              0
pressure_msl_max                  0
pressure_msl_min                  0
relative_humidity_2m_max          0
relative_humidity_2m_min          0
wet_bulb_temperature_2m_max       0
wet_bulb_temperature_2m_min       0
vapour_pressure_deficit_max       0
soil_temperature_0_to_7cm_mean    0
dtype: int64

Summary Statistics¶

Summary statistics (also called descriptive statistics) are a set of numbers that describe the central tendency, spread, and shape of your data. They serve to comprehend the key features of your data quickly.

     Measures of Central Tendency: These tell you where the center of your data lies
            Mean: The average of all values.
            Median: The middle value when data is sorted. 
            Mode: The most frequently occurring value. Highly meaningful for categorical or ordinal data, but generally not the case for continuous attributes since the spread of such data is generally large without case groupings. 
     Measures of Spread: These tell you where the center of your data lies
            Range: The difference between the highest and lowest values. 
            Interquartile Range (IQR): The range of the middle 50% of the data. 
            Variance: The average squared difference from the mean. 
            Standard Deviation: The square root of the variance, giving a measure of spread in the same units as the data.
     Measures of Shape: These tell you where the center of your data lies
            Skewness: Measures how symmetric your data is. 
                 Positive skew: tail on the right
                 Negative skew: tail on the left
                 The baseline distribution in many (but not all) cases is normal distribution. A skewness value of 0 conveys symmetry. Realistic data doesn’t have such exact value, but may come close to it if a high level of symmetry exists. 
            Kurtosis: Measures how peaked or flat your data is. 
                 High kurtosis: very peaked
                 Low kurtosis: very flat
                 The baseline distribution in many (but not all) cases is normal distribution. A kurtosis value of 3 conveys the data to be normal. Realistic data doesn’t have such exact value, but may come close to it if a high level of normality exists.

Summary statistics are essential tools in the data analyst's toolkit, providing a concise overview of the key characteristics of a dataset. By calculating numerical measures that describe the central tendency, dispersion, and shape of the data, summary statistics help analysts quickly understand the data's underlying patterns and make informed decisions.

Measures of central tendency, such as the mean, median, and mode, provide information about the typical or representative value of the data. The mean represents the average value, the median indicates the middle value when the data is sorted, and the mode identifies the most frequently occurring value. These statistics help analysts understand the central location of the data and identify any potential biases or skewness.

Measures of dispersion, such as the range, variance, and standard deviation, quantify the spread or variability of the data. The range indicates the overall spread of the data, while the variance and standard deviation measure the average squared deviation from the mean. These statistics help analysts understand how much the data points vary from the central tendency and identify outliers or unusual values.

Measures of shape, such as skewness and kurtosis, provide insights into the overall distribution of the data. Skewness measures the asymmetry of the distribution, indicating whether the tail on one side is longer than the other. Kurtosis measures the peakedness or flatness of the distribution, revealing whether the data has heavy tails or a sharp peak.

Summary statistics are invaluable for understanding the basic properties of a dataset and for making informed decisions. They can be used to identify outliers, compare different groups of data, and assess the overall distribution of the data. By effectively using summary statistics, analysts can gain valuable insights into their data and make data-driven decisions with confidence.

In [13]:
# Drop the first column, since 'date' or datetime format isn't meaningful for summary statistics
daily_data_sans_first_col = daily_dataframe_clean.iloc[:, 1:]
daily_summary_stats = daily_data_sans_first_col.describe()
print(daily_summary_stats)
       temperature_2m_mean  temperature_2m_max  temperature_2m_min  \
count         16603.000000        16603.000000        16603.000000   
mean             24.273718           25.039673           23.379768   
std               1.146305            1.359918            1.125049   
min              20.558165           21.441502           18.841499   
25%              23.376921           23.991501           22.541500   
50%              24.374832           25.141499           23.441502   
75%              25.106709           25.841499           24.191502   
max              27.792749           29.699001           27.199001   

       apparent_temperature_mean  apparent_temperature_max  \
count               16603.000000              16603.000000   
mean                   24.702103                 26.499727   
std                     2.199432                  2.520320   
min                    17.356228                 18.977034   
25%                    23.038151                 24.679927   
50%                    24.931297                 26.687723   
75%                    26.306145                 28.271816   
max                    32.091747                 34.299778   

       apparent_temperature_min  wind_speed_10m_max  \
count              16603.000000        16603.000000   
mean                  23.352652           30.715977   
std                    2.104526            6.617004   
min                   15.962875            6.792466   
25%                   21.796345           26.649727   
50%                   23.566624           31.035257   
75%                   24.888628           35.068369   
max                   31.027546           93.806084   

       et0_fao_evapotranspiration      rain_sum  dew_point_2m_max  ...  \
count                16603.000000  16603.000000      16603.000000  ...   
mean                     4.468894      2.086117         20.885281  ...   
std                      0.739465      4.837135          1.480140  ...   
min                      1.299162      0.000000         13.591500  ...   
25%                      3.995539      0.100000         19.799000  ...   
50%                      4.517640      0.800000         21.191502  ...   
75%                      4.963439      2.100000         22.091499  ...   
max                      7.167190    151.499985         24.199001  ...   

       surface_pressure_max  surface_pressure_min  pressure_msl_max  \
count          16603.000000          16603.000000      16603.000000   
mean             981.201477            978.281189       1016.515198   
std                1.829245              1.903635          1.925053   
min              969.477356            956.304626       1004.400024   
25%              980.028717            977.115845       1015.299988   
50%              981.344055            978.492126       1016.700012   
75%              982.463745            979.609558       1017.799988   
max              986.992188            983.832581       1022.500000   

       pressure_msl_min  relative_humidity_2m_max  relative_humidity_2m_min  \
count      16603.000000              16603.000000              16603.000000   
mean        1013.511536                 84.635811                 72.861374   
std            1.997123                  4.958839                  6.816984   
min          990.700012                 56.597607                 35.666039   
25%         1012.299988                 81.713306                 69.922222   
50%         1013.700012                 85.474670                 74.697205   
75%         1014.900024                 88.505695                 77.612133   
max         1019.200012                 96.143120                 87.194046   

       wet_bulb_temperature_2m_max  wet_bulb_temperature_2m_min  \
count                 16603.000000                 16603.000000   
mean                     21.826445                    20.951443   
std                       1.289275                     1.404822   
min                      16.469954                    15.306818   
25%                      20.785082                    19.948974   
50%                      22.077534                    21.231716   
75%                      22.850239                    22.076393   
max                      24.967825                    24.356148   

       vapour_pressure_deficit_max  soil_temperature_0_to_7cm_mean  
count                 16603.000000                    16603.000000  
mean                      0.863740                       26.025419  
std                       0.251260                        1.748632  
min                       0.366002                       22.646914  
25%                       0.700161                       24.816500  
50%                       0.791879                       25.829008  
75%                       0.945766                       26.641506  
max                       2.260390                       35.667747  

[8 rows x 21 columns]

Skew and Kurtosis¶

Skew and kurtosis are two statistical measures that provide valuable insights into the shape and characteristics of a probability distribution. While the mean and standard deviation offer central tendency and dispersion, skew and kurtosis delve into the asymmetry and peakedness of a dataset, respectively.

Skew measures the asymmetry of a distribution. A positive skew indicates that the tail to the right (larger values) is longer or heavier than the tail to the left. Conversely, a negative skew suggests that the tail to the left (smaller values) is longer. A zero skew implies a symmetric distribution. Skew is often visualized as a distortion of the normal distribution curve, with the peak shifted to one side and the tail extended in the opposite direction.

Kurtosis measures the peakedness or flatness of a distribution relative to a normal distribution. A high kurtosis, also known as leptokurtosis, indicates a distribution with heavy tails and a sharp peak. This means that there is a higher probability of extreme values occurring. In contrast, a low kurtosis, or platykurtosis, suggests a distribution with light tails and a flat peak, implying a lower likelihood of extreme events. A mesokurtic distribution has a kurtosis similar to a normal distribution.

Understanding skew and kurtosis is essential for data analysis and interpretation. For instance, a positively skewed distribution might suggest that there are a few very large values that are pulling the mean to the right, while a negatively skewed distribution could indicate the presence of a few very small values. Kurtosis can help identify outliers or unusual patterns in a dataset.

In [15]:
import scipy.stats as stats
#Skew and kurtosis
skewness = daily_data_sans_first_col.skew()
kurtosis = daily_data_sans_first_col.kurtosis()
print("Skewness:")
print(skewness)
print("\nKurtosis:")
print(kurtosis)
Skewness:
temperature_2m_mean              -0.080265
temperature_2m_max                0.287729
temperature_2m_min               -0.120366
apparent_temperature_mean        -0.181386
apparent_temperature_max         -0.097145
apparent_temperature_min         -0.182777
wind_speed_10m_max               -0.031380
et0_fao_evapotranspiration       -0.260995
rain_sum                          9.741498
dew_point_2m_max                 -0.675294
dew_point_2m_min                 -0.967386
surface_pressure_max             -0.437890
surface_pressure_min             -0.847299
pressure_msl_max                 -0.411078
pressure_msl_min                 -0.827830
relative_humidity_2m_max         -0.977159
relative_humidity_2m_min         -1.182832
wet_bulb_temperature_2m_max      -0.426486
wet_bulb_temperature_2m_min      -0.623472
vapour_pressure_deficit_max       1.489722
soil_temperature_0_to_7cm_mean    1.471327
dtype: float32

Kurtosis:
temperature_2m_mean                -0.575250
temperature_2m_max                 -0.045151
temperature_2m_min                 -0.356732
apparent_temperature_mean          -0.512940
apparent_temperature_max           -0.415506
apparent_temperature_min           -0.437406
wind_speed_10m_max                  1.502347
et0_fao_evapotranspiration          0.357547
rain_sum                          166.088089
dew_point_2m_max                    0.136888
dew_point_2m_min                    0.824719
surface_pressure_max                0.598648
surface_pressure_min                3.302744
pressure_msl_max                    0.517645
pressure_msl_min                    3.165608
relative_humidity_2m_max            1.211777
relative_humidity_2m_min            1.198043
wet_bulb_temperature_2m_max        -0.528162
wet_bulb_temperature_2m_min        -0.190336
vapour_pressure_deficit_max         2.257868
soil_temperature_0_to_7cm_mean      3.004556
dtype: float32

Histograms and Quantile-Quantile Plots¶

Now, to provide a visual display of the distributions to acquire and apply visual judgement. Histograms provide pictures or the shapes of the distributions, while Q-Q plots provide view of the divergence or disaparity from (in our case) normal distribution.

NOTE: the benchmark or ideal distribution doesn't have to be normal.

In [17]:
import matplotlib.pyplot as plt
import seaborn as sns

# Get the column names
column_names = daily_data_sans_first_col.columns
print(column_names)
column_names_list = column_names.tolist()

# Calculating the number of ros and columns for subplots.
num_cols = 3  # 3 columns
num_rows = (len(column_names_list) + num_cols - 1) // num_cols
     # Calculating the number of rows

# Creating subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize = (15, 10))

# Flatten if required.
if num_rows > 1:
  axes = axes.flatten()

# Plot the histograms
for i, col in enumerate(column_names_list):
  sns.histplot(data = daily_data_sans_first_col[col], ax = axes[i], kde = True)
  axes[i].set_title(f'Histogram of {col}')
  axes[i].set_xlabel('Value')
  axes[i].set_ylabel('Frequency')
  axes[i].grid(True)
# Adjust layout
plt.tight_layout()
plt.show()
Index(['temperature_2m_mean', 'temperature_2m_max', 'temperature_2m_min',
       'apparent_temperature_mean', 'apparent_temperature_max',
       'apparent_temperature_min', 'wind_speed_10m_max',
       'et0_fao_evapotranspiration', 'rain_sum', 'dew_point_2m_max',
       'dew_point_2m_min', 'surface_pressure_max', 'surface_pressure_min',
       'pressure_msl_max', 'pressure_msl_min', 'relative_humidity_2m_max',
       'relative_humidity_2m_min', 'wet_bulb_temperature_2m_max',
       'wet_bulb_temperature_2m_min', 'vapour_pressure_deficit_max',
       'soil_temperature_0_to_7cm_mean'],
      dtype='object')
No description has been provided for this image
In [18]:
# Get the column names
column_names_no_date = daily_data_sans_first_col.columns
print(column_names_no_date)
column_names_list_no_date = column_names_no_date.tolist()


# Creating subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize = (15, 10))

# Flatten if required.
if num_rows > 1:
  axes = axes.flatten()

# Plot the histograms
for i, col in enumerate(column_names_list_no_date):
  ax = axes[i]
  stats.probplot(daily_data_sans_first_col[col], dist = "norm", plot = ax)
  ax.set_title(f'QQ Plot of {col}')
  ax.grid(True)
# Adjust layout
plt.tight_layout()
plt.show()
Index(['temperature_2m_mean', 'temperature_2m_max', 'temperature_2m_min',
       'apparent_temperature_mean', 'apparent_temperature_max',
       'apparent_temperature_min', 'wind_speed_10m_max',
       'et0_fao_evapotranspiration', 'rain_sum', 'dew_point_2m_max',
       'dew_point_2m_min', 'surface_pressure_max', 'surface_pressure_min',
       'pressure_msl_max', 'pressure_msl_min', 'relative_humidity_2m_max',
       'relative_humidity_2m_min', 'wet_bulb_temperature_2m_max',
       'wet_bulb_temperature_2m_min', 'vapour_pressure_deficit_max',
       'soil_temperature_0_to_7cm_mean'],
      dtype='object')
No description has been provided for this image

Scatterplots Encompassing all Physical Variables¶

Observation of scatter plots is a customary preliminary method of model determination involving the observed variables.

In [20]:
# Creating PairGrid with three columns
g = sns.PairGrid(daily_data_sans_first_col)
# Mapping scatterplots
g.map(sns.scatterplot)

# Adjusting columns specification
plt.subplots_adjust(left = 0.1, right = 0.9, top = 0.9, bottom = 0.1, wspace = 0.3, hspace = 0.3)

# Showing plot
plt.show()
No description has been provided for this image

Such prior scatter plots provide detail on whether linearity exists among variable pairs. Scatter plotting is a preliminary investigation to determine whether predictive models like (multi)linear regression will be practical. Some scatter plots fall into a linear-type orientation due to observed general trend in data orientation. Else, there are scatter plots whose data are highly condensed or clustered with general shapes.

NOTE: the "perfectly" linear scatter plots in the main diagonal are to be ignored because such cases are variables plotted against themselves, which isn't meaningful.

Correlation and Correlation Heatmaps¶

Correlation measures the strength and direction of the linear relationship between two variables. In other words, it quantifies how well a change in one variable can be associated with a corresponding change in another, based on a straight-line relationship.

Correlation refers to the statistical relationship or association between two variables. When two variables are correlated, changes in one variable tend to be accompanied by changes in the other variable.

Correlation is typically measured using a correlation coefficient, which quantifies the strength and direction of the relationship between the variables. The Pearson correlation coefficient ranges from -1 to 1:

A correlation coefficient of 1 indicates a perfect positive correlation, meaning that as one variable increases, the other variable also increases proportionally.

A correlation coefficient of -1 indicates a perfect negative correlation, meaning that as one variable increases, the other variable decreases proportionally.

A correlation coefficient of 0 indicates no correlation, meaning that there is no systematic relationship between the variables.

As well, for the Pearson measure a high correlation (regardless of sign) value conveys a possible linear relationship between the variables being compared.

Correlation does not imply causation, meaning that even if two variables are correlated, it does not necessarily mean that changes in one variable cause changes in the other variable. Correlation simply quantifies the degree to which two variables vary together. A crude but effective example, "the number of firefighters in operations service corresponds to the number of hazardous fires occuring....however, more firefighters don't cause more fires."

The Pearson correlation coefficient ($r$) is the most common measure of linearity. It ranges from -1 to +1:

$r = +1$: Perfect positive linear relationship. All points lie on an upward-sloping straight line.

$r = -1$: Perfect negative linear relationship. All points lie on a downward-sloping straight line.

$r = 0$: No linear relationship; the data points do not form a recognizable line.

In [24]:
# Applying Pearson correlation to the data set
daily_pearson_corr = daily_dataframe_clean.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (18, 14))
sns.heatmap(daily_pearson_corr, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation heatmap of Montserrat Daily Meteorological Data')
plt.savefig('daily_heatmap.pdf', format = 'pdf')
plt.show()
No description has been provided for this image

The correlation heat map appears to be consistent with the scatter plot matrix from earlier. Pearson correlation conveys levels of association among attributes and/or level of possible linearity. For highly correlated data pairs (> 0.8) the scatter plots will conform to a line with low instantaneous rate of change (with positive or negative slope); lower correlation, say, 0.8 to 0.4 in magnitude would appear elliptical; a correlation of 0 would appear circular, or have some irregular shape, or have clusters, or at least two axes that can well span the data dispersion (where one direction isn't dominant over the other). NOTE: one should not assume that natural real attributes must have linear relationships.

Time Series¶

The prior correlation heatmap conveys very low association between time/date and the physical variables. If the time or date data is not highly correlated with the meteorological data, such indicates that the temporal aspect of the data does not show a strong linear relationship with those meteorological variables. Yet, there should not be the assumption that general data pairs are naturally linear.

A time series is a set of data points that occur in successive order over a period of time. The data applied is reflective of such.

Time series analysis concerns observing possible residing trend, seasonality or cycle properties. The time series data is decomposed to uniquely identify the possible existence of such characteristics.

A time series is a sequence of data points collected over time. Mathematically, it can be represented as:

$$Y(t)=T(t)+S(t)+\epsilon(t)$$

$Y(t)$: The observed value of the time series at time t.

$T(t)$: The trend component, representing the long-term direction of the series.

$S(t)$: The seasonal component, representing periodic fluctuations within a fixed time period.

$\epsilon(t)$: The residual or noise component, representing the random fluctuations that cannot be explained by the trend or seasonal components.

Trend Component The trend component can be modeled using various functions, such as:

Linear: $T(t) = \alpha\,t + \beta$

Polynomial: $T(t) = \alpha_0 + \alpha_1(t) + \alpha_2(t^2) + ... + \alpha_n(t^n)$

Exponential: $T(t) = \alpha\,\text {e}^{\beta(t)}$

Logistic: $T(t) = \frac{\alpha}{1+ \beta\,\text {e}^{(-\gamma(t))}}$

Seasonal Component The seasonal component can be modeled using periodic functions, such as:

Sine/Cosine: $S(t) = \alpha\,\text {sin}(\omega\,t + \phi) + \beta\,\text {cos}(\omega,t + \phi)$

Fourier series: $S(t) = a_0 + Σ[a\,n\,cos(2πnt/T) + b\,n\,sin(2πnt/T)]$

Residual Component The residual component is often assumed to be white noise, meaning it has:

Zero mean: $E[\epsilon(t)] = 0$

Constant variance: $Var[\epsilon(t)] = \sigma^2$

No autocorrelation: $\text {Cov}[\epsilon(t), epsilon(s)] = 0\,\,\,\, \text{for t} \neq s$

Stationarity A time series is said to be stationary if its statistical properties (mean, variance, autocorrelation) remain constant over time. Stationarity is a common assumption in many time series models.

Time Series with LOWESS Smoothing¶

LOWESS (Locally Weighted Scatterplot Smoothing) is a non-parametric regression technique used to smooth data in time series or scatterplots. It is particularly useful for capturing trends in data without assuming a specific functional form, making it ideal for exploratory data analysis.

Local Regression: LOWESS performs a series of localized linear regressions across the data. For each point in the dataset, it fits a weighted linear regression using a subset of nearby data points.

Weighted Fitting: Points closer to the target point (in terms of x-values) are given more weight in the local regression. The weight decreases as the distance between the target point and neighboring points increases, often using a tricube weighting function.

Smoothing Parameter (frac): This controls the span or bandwidth of the smoothing window --

  1. A small frac (close to 0) uses fewer neighboring points for each local fit, resulting in a curve that closely follows the data (less smoothing).

  2. A large frac (closer to 1) uses more points for each local fit, producing a smoother curve that captures broader trends but may miss finer details.

Flexible Smoothing: Unlike parametric models that assume a specific relationship (e.g., linear, quadratic), LOWESS adapts to the data. It is especially useful when the true relationship between variables is unknown or non-linear.

Handles Non-Linear Trends: LOWESS can reveal complex patterns, such as oscillations or sudden shifts in time series data, that linear models cannot easily capture.

Local Behavior: Since LOWESS is local to each point, it can adapt to different patterns in different parts of the dataset, making it more flexible than global smoothing methods like polynomial fitting.

No Assumptions About Distribution: As a non-parametric method, LOWESS doesn’t require assumptions about the underlying distribution of the data (e.g., normality), making it a robust choice for noisy or irregular data.

When plotting time series data, raw data may contain a lot of noise, making it difficult to identify general trends. LOWESS helps to:

  1. Smooth Out Short-Term Fluctuations: It filters out high-frequency noise, leaving a clearer picture of long-term trends.

  2. Identify Underlying Patterns: It can reveal the shape and nature of the trend, even in the presence of noisy or irregular data.

A Local Polynomial Regression¶

For each point $t$ in the time series, LOWESS performs a local regression using a subset of the data. Such local polynomial regression can be expressed mathematically as:

$$\hat{Y}(t) = \sum_{j=1}^{n} W_{j}(t) \cdot Y(t_{j}) \cdot \frac{(t_{j} - t)^{p}}{h^{p}}$$

Where $Y(t_j)$ is the value of the time series at some point $j$.

$W_j(t)$ is a weight assigned to the the observation $Y(t_j)$ based on its distance from $t$.

$h$ is the bandwidth parameter that determines the size of the local neighbourhood.

$p$ is the degree of the local polynomial (commonly 1 or 2).

Weight Function

The weights $W_j(t)$ are computed using a weight function, commonly a tricube weight function defined as:

$$ W_{j}(t) = \begin{cases} (1 - |d|^3)^3 & \text{if } |d| < 1 \\ 0 & \text{if } |d| \geq 1 \end{cases} $$

where $d = \frac{|t_{j} - t|}{h}$.

Iterative Fitting

LOWESS can also be implemented in an iterative manner, refining the fit by iterating through the residuals and re-weighting the observations based on their distance from the local fit.

In [31]:
import statsmodels.api as sm

# Create a new DataFrame meteo_data_numeric to avoid modifying the original meteo_data
daily_data_numeric = daily_dataframe_clean.copy()

# Remove rows with invalid 'date' values in the new DataFrame
daily_data_numeric = daily_data_numeric.dropna(subset=['date'])

# Convert 'date' to Unix timestamps in the new DataFrame
daily_data_numeric['date_numeric'] = daily_data_numeric['date'].apply(lambda x: x.timestamp())

# Setting the style of the seaborn plots
sns.set_style('whitegrid')

# Defining the variables to plot against "date_numeric"
variables_to_plot = daily_data_numeric.columns.drop(['date', 'date_numeric']).tolist()

# Define a color palette
color_palette = sns.color_palette("husl", len(variables_to_plot))

# Plotting each variable against 'date_numeric' with smoothing in the new DataFrame
for variable, color in zip(variables_to_plot, color_palette):
    plt.figure(figsize=(12, 6))

    # Plot the original data using Seaborn lineplot
    sns.lineplot(data=daily_data_numeric, x='date_numeric', y=variable, color=color, label='Original')

    # Apply LOWESS smoothing
    smoothed = sm.nonparametric.lowess(daily_data_numeric[variable], daily_data_numeric['date_numeric'], frac=0.1)

    # Plot the smoothed line
    plt.plot(daily_data_numeric['date_numeric'], [point[1] for point in smoothed], color='red', linestyle='--', label='Smoothed')

    # Add title, labels, and formatting
    plt.title(f'Time Series Plot of {variable} with Smoothing', fontsize=14)
    plt.xlabel('Date (Unix Timestamp)')
    plt.ylabel(variable)
    plt.xticks(rotation=45)  # Rotating x-axis labels for better readability
    plt.legend()
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Augmented Dickey- Fuller (ADF) Test¶

From the Statsmodels package, stationarity conveys that the statistical properties of a time series, say, the mean, variance and covariance do not vary over time. Many statistical models require the series to be stationary to make effective and precise predictions. Two statistical tests would be used to check the stationarity of a time series - Augmented Dickey Fuller (“ADF”) Test and Kwiatkowski-Phillips-Schmidt-Shin (“KPSS”) Test.

Critical values are thresholds that determine whether the test statistic obtained from the ADF test is significant or not. The ADF test is commonly used to assess the stationarity of a time series data.

Here's how critical values work in this context:

Unit Root: A series with a unit root is a non-stationary time series, namely, it possesses a changining variance over time. Such a property makes the time series difficult toi analyze and model.

ADF Test Statistic: The ADF test calculates a test statistic based on the degree of non-stationarity in the time series data. This test statistic is compared against critical values to determine whether the data is stationary or non-stationary.

Null Hypothesis: The null hypothesis of the ADF test is that the time series data has a unit root, indicating it is non-stationary. The alternative hypothesis is that the data is stationary.

Critical Values: Critical values are pre-defined thresholds derived from statistical distributions, such as the Dickey-Fuller distribution. These critical values correspond to different levels of significance (e.g., 1%, 5%, 10%). They represent the values beyond which the ADF test statistic must exceed for the null hypothesis to be rejected.

If the ADF test statistic is more negative than the critical values, it provides evidence against the null hypothesis, suggesting stationarity in the data.

Conversely, if the ADF test statistic is less negative than the critical values, there's insufficient evidence to reject the null hypothesis, indicating non-stationarity in the data.

Interpretation: Typically, if the absolute value of the ADF test statistic is less negative than the critical values at a chosen significance level (e.g., 5%), then we fail to reject the null hypothesis, implying that the time series data is non-stationary. Conversely, if the absolute value of the test statistic is more negative than the critical values, we reject the null hypothesis, indicating stationarity in the data.

The ADF test is used to determine the presence of unit root in the series, and hence helps in understand if the series is stationary or not. The null and alternate hypothesis of this test are:

NULL HYPOTHESIS: The series has a unit root.

ALTERNATE HYPOTHESIS: The series has no unit root.

If the null hypothesis is failed to be rejected, this test may provide evidence that the series is non-stationary.

Autoregressive models are statistical models used for time series analysis, where present values are predicted based on a linear combination of past values. Such models assume that past behavior influences future outcomes, making them meaningful for forecasting trends and patterns in data over time (Fernando 2024). Test} ullr (ADF) test model is given by the following equa$$\Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \sum_{i=1}^{p} \delta_i \Delta y_{t-i} + \epsilon_t$$

Where:

$ y_t $ is the time series being tested,

$ \Delta y_t = y_t - y_{t-1} $ is the first difference of the time series,

$ t $ is the time trend (optional),

$ \alpha $ is a constant (drift term),

$ \beta t $ represents the deterministic time trend (optional),

$ \gamma $ is the coefficient for testing the presence of a unit root,

$ \delta_i $ are the coefficients for the lagged difference terms,

$ p $ is the number of lags of the differenced terms,

$ \epsilon_t $ is the white noise error term.

The hypotheses for the ADF test are as follows:

$ H_0: \gamma = 0 $ The series has a unit root, i.e., it is non-stationary;

$ H_A: \gamma < 0 $ The series is stationary

The test statistic is calculated using the $ t $-statistic of the estimated $ \gamma $:

$ \tau = \frac{\hat{\gamma}}{SE(\hat{\gamma})} $

Where:

$ \hat{\gamma} $ is the estimated coefficient for $ y_{t-1} $,

$ SE(\hat{\gamma}) $ is the standard error of $ \hat{\gamma} $.

If the test statistic $ \tau $ is more negative than the critical value, we reject the null hypothesis and conclude that the series is stationary.

If $ \tau $ is less negative than the critical value, we fail to reject the null hypothesis, implying the series has a unit root and is non-stationary.

In [34]:
from statsmodels.tsa.stattools import adfuller

# Initialize an empty list to store columns with p-values greater than 0.05
columns_with_high_p_values = []

# Loop through each column in the DataFrame
for column in daily_dataframe_clean.columns:
    # Check if the column is constant
    if daily_dataframe_clean[column].nunique() == 1:
        print(f"Column '{column}' is constant and will be skipped.")
        continue

    # Performing ADF test on the current column.
    result = adfuller(daily_dataframe_clean[column].dropna())

    # Extracting ADF test results for the current column
    print(f"ADF Test Results for '{column}':")
    print(f"ADF Statistic: {result[0]}")
    print(f"p-value: {result[1]}")
    print(f"Critical Values: {result[4]}")
    print("\n")

    # Check if the p-value is greater than 0.05 (non-stationary)
    if result[1] > 0.05:
        columns_with_high_p_values.append(column)

# Create a data frame containing columns with p-values greater than 0.05 (non-stationary)
non_stationary_columns_df = daily_dataframe_clean[columns_with_high_p_values]
print("Columns with p-values greater than 0.05 (non-stationary):")
print(non_stationary_columns_df.head())
ADF Test Results for 'date':
ADF Statistic: 127.521177222858
p-value: 1.0
Critical Values: {'1%': -3.4307447795924704, '5%': -2.8617144767135985, '10%': -2.5668628695330438}


ADF Test Results for 'temperature_2m_mean':
ADF Statistic: -9.452164604817089
p-value: 4.585375734432722e-16
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}


ADF Test Results for 'temperature_2m_max':
ADF Statistic: -8.140604784014066
p-value: 1.030640432208805e-12
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}


ADF Test Results for 'temperature_2m_min':
ADF Statistic: -8.81654710030056
p-value: 1.9274119348856e-14
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}


ADF Test Results for 'apparent_temperature_mean':
ADF Statistic: -9.54169935641269
p-value: 2.715772031935339e-16
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}


ADF Test Results for 'apparent_temperature_max':
ADF Statistic: -9.331043783206056
p-value: 9.327040813917995e-16
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}


ADF Test Results for 'apparent_temperature_min':
ADF Statistic: -9.388649864148997
p-value: 6.652675031865217e-16
Critical Values: {'1%': -3.430744970348345, '5%': -2.861714561014421, '10%': -2.5668629144052084}


ADF Test Results for 'wind_speed_10m_max':
ADF Statistic: -19.007591367188702
p-value: 0.0
Critical Values: {'1%': -3.4307444938038882, '5%': -2.861714350414922, '10%': -2.566862802306002}


ADF Test Results for 'et0_fao_evapotranspiration':
ADF Statistic: -8.943862977329935
p-value: 9.099297004054091e-15
Critical Values: {'1%': -3.4307447795924704, '5%': -2.8617144767135985, '10%': -2.5668628695330438}


ADF Test Results for 'rain_sum':
ADF Statistic: -18.165064292851632
p-value: 2.4554926496016038e-30
Critical Values: {'1%': -3.4307445890207626, '5%': -2.8617143924941595, '10%': -2.5668628247041987}


ADF Test Results for 'dew_point_2m_max':
ADF Statistic: -9.386741780827915
p-value: 6.7275145125415855e-16
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}


ADF Test Results for 'dew_point_2m_min':
ADF Statistic: -9.155005530487388
p-value: 2.6242011536920892e-15
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}


ADF Test Results for 'surface_pressure_max':
ADF Statistic: -12.997314804753536
p-value: 2.732954497172046e-24
Critical Values: {'1%': -3.430744565212235, '5%': -2.861714381972446, '10%': -2.566862819103636}


ADF Test Results for 'surface_pressure_min':
ADF Statistic: -13.306442710596356
p-value: 6.88255466457685e-25
Critical Values: {'1%': -3.4307445176037983, '5%': -2.8617143609328277, '10%': -2.566862807904538}


ADF Test Results for 'pressure_msl_max':
ADF Statistic: -14.03454018893963
p-value: 3.387816589160927e-26
Critical Values: {'1%': -3.4307443986329553, '5%': -2.861714308355986, '10%': -2.566862779918612}


ADF Test Results for 'pressure_msl_min':
ADF Statistic: -12.990868429724106
p-value: 2.8143830251545577e-24
Critical Values: {'1%': -3.4307445176037983, '5%': -2.8617143609328277, '10%': -2.566862807904538}


ADF Test Results for 'relative_humidity_2m_max':
ADF Statistic: -10.04243834718041
p-value: 1.4840671835685065e-17
Critical Values: {'1%': -3.430744922642096, '5%': -2.861714539931579, '10%': -2.5668629031831025}


ADF Test Results for 'relative_humidity_2m_min':
ADF Statistic: -6.335038407097828
p-value: 2.844371020791204e-08
Critical Values: {'1%': -3.430744898793293, '5%': -2.861714529392068, '10%': -2.566862897573066}


ADF Test Results for 'wet_bulb_temperature_2m_max':
ADF Statistic: -10.193553778660885
p-value: 6.230936312873574e-18
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}


ADF Test Results for 'wet_bulb_temperature_2m_min':
ADF Statistic: -9.75580665328518
p-value: 7.795623908912306e-17
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}


ADF Test Results for 'vapour_pressure_deficit_max':
ADF Statistic: -4.754890939104802
p-value: 6.628746720590535e-05
Critical Values: {'1%': -3.43074494649378, '5%': -2.8617145504723633, '10%': -2.5668629087938166}


ADF Test Results for 'soil_temperature_0_to_7cm_mean':
ADF Statistic: -6.294926686120316
p-value: 3.524903027117927e-08
Critical Values: {'1%': -3.430744970348345, '5%': -2.861714561014421, '10%': -2.5668629144052084}


Columns with p-values greater than 0.05 (non-stationary):
                       date
0 1980-01-08 04:00:00+00:00
1 1980-01-09 04:00:00+00:00
2 1980-01-10 04:00:00+00:00
3 1980-01-11 04:00:00+00:00
4 1980-01-12 04:00:00+00:00

From the above results, excluding the 'date' index, there are no attributes with non-stationarity. Hence, co-integration analysis among attribute pairs are not possible. Co-integration concerns observing the long-term trend between two variables, to identify any possible similar behaviour. For the case of lacking non-stationarity, one needs to revert to measures like correlation.

Long-Term Forecasting¶

When it comes to long-term forecasting, there are several approaches and techniques one can apply that doesn't explicitly require the data to be stationary. Some examples:

  1. Facebook's Prophet: This is a powerful forecasting tool that can handle missing data and outliers. It works well with daily data and captures seasonality without needing to transform the data to be stationary.
  2. Multiple Linear Regression: You can use regression techniques to forecast future values based on one or more predictor variables without the need for stationarity. This approach works well when you have external factors that influence your target variable.

Forecasting with Prophet¶

Prophet is an open-source forecasting tool developed by Facebook, designed specifically for making forecasts with time series data.

KEY FEATURES OF PROPHET:

  1. Automatic Seasonal Adjustment:

Prophet automatically detects and accounts for yearly, weekly, and daily seasonal effects in the data. This is especially useful for datasets that show clear periodic trends.

  1. Flexible Trend Modeling:

Prophet can model trends that change over time, including linear and logistic growth models. This allows it to adapt to both consistent growth and more complex trend behaviors.

  1. Handling of Missing Data:

Prophet is robust to missing data points and can perform well even if some timestamps are missing.

  1. User-Friendly:

Designed to be easy to use for both novices and experienced data scientists, it requires minimal preprocessing of the data.

  1. Outlier Detection:

The model can identify and handle outliers, which can significantly impact forecast accuracy.

  1. Incorporation of Holidays:

Users can include custom holidays and special events, allowing the model to account for effects that might not be captured by the seasonal trends alone.

  1. Scalability:

Prophet is efficient for large datasets and can quickly fit models and generate forecasts.

MATHEMATICAL STRUCTURE OF PROPHET

Prophet decomposes a time series into three main components:

Trend Component:

Represents the long-term progression of the series.

Can be modeled as a piecewise linear or logistic growth curve. The algorithm automatically detects changes in the trend (change points).

$$g(t)=\text {piecewise linear or logistic growth function}$$

Seasonal Component:

Captures periodic fluctuations in the data, which can occur yearly, weekly, or daily.

Seasonal effects are modeled using Fourier series. The number of Fourier terms can be adjusted for each seasonality.

$$s(t) = \sum_{n=1}^{N} \left[ a_n \cos\left( \frac{2 \pi n t}{T} \right) + b_n \sin\left( \frac{2 \pi n t}{T} \right) \right]$$

Holiday Effects:

Incorporates the effects of holidays that can cause significant changes in the time series.

The holiday effect can be treated as an additional regressor in the model.

$$h(t) = \sum_{i=1}^{H} \delta_i \cdot I(t \in \text{holiday}_i)$$

$H$ is the number of holidays.

$\delta_i$ is the effect of holiday $i$.

$I(t \in \text{holiday}_i)$ is an indicator function that is 1 if $t$ falls on holiday $i$.

Overall model represented by:

$$y(t)=g(t)+s(t)+h(t)+\epsilon_t$$

$\epsilon_t$ is the error term, assumed to be normally distributed.

In [38]:
from prophet import Prophet

# Ensure 'date' is not set as the index
if 'date' not in daily_dataframe_clean.columns:
    meteo_data.reset_index(inplace=True)

# List of variables to forecast (excluding 'date')
variables_to_forecast = daily_dataframe_clean.columns.drop('date')

# Create a function to fit the model and make predictions
def forecast_variable(variable):
    # Prepare data
    forecast_data = daily_dataframe_clean[['date', variable]].rename(columns={'date': 'ds', variable: 'y'})

    # Remove timezone information if present
    forecast_data['ds'] = forecast_data['ds'].dt.tz_localize(None)  # Remove timezone

    # Initialize the Prophet model
    model = Prophet()

    # Fit the model
    model.fit(forecast_data)

    # Make future predictions for the next 365 days
    future = model.make_future_dataframe(periods=365)
    forecast = model.predict(future)

    # Plot the forecast
    fig = model.plot(forecast)
    plt.title(f'Long-term Forecast for {variable}')
    plt.xlabel('Date')
    plt.ylabel(variable)
    plt.show()

    return forecast

# Loop through each variable and forecast
forecasts = {}
for variable in variables_to_forecast:
    forecasts[variable] = forecast_variable(variable)
22:28:54 - cmdstanpy - INFO - Chain [1] start processing
22:29:14 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:29:19 - cmdstanpy - INFO - Chain [1] start processing
22:29:35 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:29:40 - cmdstanpy - INFO - Chain [1] start processing
22:29:54 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:29:58 - cmdstanpy - INFO - Chain [1] start processing
22:30:13 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:30:17 - cmdstanpy - INFO - Chain [1] start processing
22:30:30 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:30:35 - cmdstanpy - INFO - Chain [1] start processing
22:30:50 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:30:55 - cmdstanpy - INFO - Chain [1] start processing
22:31:01 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:31:06 - cmdstanpy - INFO - Chain [1] start processing
22:31:11 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:31:16 - cmdstanpy - INFO - Chain [1] start processing
22:31:21 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:31:28 - cmdstanpy - INFO - Chain [1] start processing
22:31:43 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:31:47 - cmdstanpy - INFO - Chain [1] start processing
22:32:01 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:32:06 - cmdstanpy - INFO - Chain [1] start processing
22:32:18 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:32:23 - cmdstanpy - INFO - Chain [1] start processing
22:32:33 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:32:38 - cmdstanpy - INFO - Chain [1] start processing
22:32:50 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:32:55 - cmdstanpy - INFO - Chain [1] start processing
22:33:04 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:33:12 - cmdstanpy - INFO - Chain [1] start processing
22:33:20 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:33:24 - cmdstanpy - INFO - Chain [1] start processing
22:33:36 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:33:42 - cmdstanpy - INFO - Chain [1] start processing
22:33:56 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:34:01 - cmdstanpy - INFO - Chain [1] start processing
22:34:14 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:34:19 - cmdstanpy - INFO - Chain [1] start processing
22:34:39 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
22:34:46 - cmdstanpy - INFO - Chain [1] start processing
22:35:02 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image

Multilinear Regression¶

Such is a statistical method that explores the relationship between a dependent variable (target) and two or more independent variables (features or predictors) by fitting a linear equation to observed data. It aims to understand how changes in the independent variables collectively influence the dependent variable.

The values that multiply the predictors in the regression equation. These coefficients indicate the strength and direction of the relationship between each predictor and the target, holding other variables constant.

Multiple linear regression finds the best-fitting linear equation that minimizes the differences between the observed values of the dependent variable and the values predicted by the equation. This is typically done by minimizing the sum of squared errors between the predicted and actual values.

Recalling all non-time attributes to be float types, hence multilinear regression is applicable:

In [41]:
daily_data_sans_first_col.info()
<class 'pandas.core.frame.DataFrame'>
Index: 16603 entries, 0 to 16602
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   temperature_2m_mean             16603 non-null  float32
 1   temperature_2m_max              16603 non-null  float32
 2   temperature_2m_min              16603 non-null  float32
 3   apparent_temperature_mean       16603 non-null  float32
 4   apparent_temperature_max        16603 non-null  float32
 5   apparent_temperature_min        16603 non-null  float32
 6   wind_speed_10m_max              16603 non-null  float32
 7   et0_fao_evapotranspiration      16603 non-null  float32
 8   rain_sum                        16603 non-null  float32
 9   dew_point_2m_max                16603 non-null  float32
 10  dew_point_2m_min                16603 non-null  float32
 11  surface_pressure_max            16603 non-null  float32
 12  surface_pressure_min            16603 non-null  float32
 13  pressure_msl_max                16603 non-null  float32
 14  pressure_msl_min                16603 non-null  float32
 15  relative_humidity_2m_max        16603 non-null  float32
 16  relative_humidity_2m_min        16603 non-null  float32
 17  wet_bulb_temperature_2m_max     16603 non-null  float32
 18  wet_bulb_temperature_2m_min     16603 non-null  float32
 19  vapour_pressure_deficit_max     16603 non-null  float32
 20  soil_temperature_0_to_7cm_mean  16603 non-null  float32
dtypes: float32(21)
memory usage: 1.5 MB

Recall the Pearson Correlation Heatmap:

In [43]:
# Applying Pearson correlation to the data set
daily_pearson_corr = daily_dataframe_clean.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (18, 14))
sns.heatmap(daily_pearson_corr, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation heatmap of Montserrat Daily Meteorological Data')
plt.savefig('daily_heatmap.pdf', format = 'pdf')
plt.show()
No description has been provided for this image

Quantile Regression¶

One can use regression techniques to forecast future values based on one or more predictor variables without the need for stationarity. This approach works well when you have external factors that influence your target variable.

Firstly, observing the scatter plots and Pearson correlation heatmap there's evidence for high nonlinearity among variable pairs. Then again, linear regression is not the only type of regression.

Quantile Regression can effectively manage nonlinearity in relationships between variables. Unlike ordinary least squares (OLS) regression, which estimates the conditional mean of the response variable given certain predictor variables, quantile regression estimates the conditional quantiles (e.g., median, quartiles) of the response variable. This allows it to provide a more comprehensive view of the relationship between variables, particularly in the presence of nonlinearity.

Basic Quantile Regression Model:

The mathematical formulation of quantile regression (Koenker and Hallock 2001) can be defined as follows:

  1. Model Specification – quantile regression model for a given quantile $\tau$ where $(0 < \tau < 1)$ can be expressed in the manner of
$$Q_y(\tau\,|X)=X\beta(\tau)$$

$Q_y(\tau|X)$ is the $\tau$-quantile of the response variable (target) $y$ given the predictor variables (features) $X$.

$X$ is a vector of predictor variables.

$\beta(\tau)$ is the vector of coefficients associated with the quantile $\tau$.

  1. Objective Function – the estimation of the quantile regression coefficients $\beta(\tau)$ is done by minimizing the following loss function:
$$\min_{\beta} \sum_{i=1}^{n} \rho_\tau(y_i - X_i \beta)$$

$n$ being the number of observation,

$\rho_{\tau}(u)$ as the quantile loss function, define by:

$$\rho_\tau(u) = \begin{cases} \tau u & \text{if } u \geq 0 \\(\tau - 1) u & \text{if } u < 0 \end{cases}$$

Such above function provides different penalties for positive and negative residuals based on the quantiles of interest.

Feature Selection¶

This process concerns identifying the features that are most influential upon the target of concern. In this project Random Forest regressor is applied. Random Forest is an ensemble learning method, commonly applying decision trees.

Random Forest Feature Selection:

There is the challenge to identify attributes (or features or predictors) that influence a target variable without (cognitive) bias. Feature selection techniques can be applied for determining the importance or relevance of features in predictive modeling.

The random forest (regressor) will be applied for feature selection. Firstly, Random Forest is a popular ensemble learning algorithm that combines the predictions of multiple decision trees to improve accuracy and reduce overfitting. It's a versatile method used for both classification and regression tasks. The algorithm creates multiple decision trees by randomly sampling the training data with replacement. This process is known as bootstrapping (in a bagging sense). Each tree is trained on a different subset of the data. For each decision node in a tree, a random subset of features is selected. This helps to prevent overfitting by reducing the correlation between trees. Once all trees are trained, their predictions are combined to make a final decision. For classification tasks, a majority vote is used. For regression tasks, the average of the predictions is taken.

One of the key strengths of Random Forest is its ability to reduce overfitting. By creating multiple decision trees and averaging their predictions, the algorithm effectively mitigates the risk of any individual tree becoming overly specialized to the training data. This ensemble approach helps to generalize the model and improve its performance on unseen data.

Moreover, Random Forest consistently outperforms individual decision trees, especially when dealing with complex datasets. The combination of multiple diverse models leads to a more accurate and robust prediction.

Another valuable aspect of Random Forest is its capability to assess feature importance. By analyzing the frequency with which features are selected in the decision trees, the algorithm can provide insights into which variables are most influential in the prediction process. This information is invaluable for understanding the underlying relationships and making informed decisions about feature selection or engineering.

Random Forest is also known for its robustness to noise and outliers. The ensemble nature of the algorithm helps to reduce the impact of individual noisy data points, making it more resilient to variations in the data.

Furthermore, Random Forest is highly scalable, capable of handling large datasets and high-dimensional feature spaces efficiently. This scalability makes it suitable for a wide range of applications. Such includes geophysics, bioological sciences, medical diagnosis, financial forecasting, and sports.

Visuals for Random Forest Regressor:

  1. Prediction Line Visualization:

Shows how the Random Forest fits the data, highlighting the average behavior of multiple decision trees.

  1. Tree Structure Visualization:

Displays the structure of an individual decision tree used in the regression to understand the splits.

  1. Feature Importance Plot:

Shows the importance of each feature in the Random Forest regression model.

An example visualization of how a Random Forest functions (for one and multiple features):

In [46]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import plot_tree
import seaborn as sns

# Generate synthetic regression data
np.random.seed(42)
X = np.random.rand(100, 1) * 10  # 100 data points, one feature
y = 2 * np.sin(X).ravel() + np.random.normal(0, 0.5, X.shape[0])

# Train Random Forest Regressor
regressor = RandomForestRegressor(n_estimators=15, random_state=42)
regressor.fit(X, y)

# 1. Prediction Line Visualization
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color="blue", label="Data")
X_test = np.linspace(0, 10, 500).reshape(-1, 1)
y_pred = regressor.predict(X_test)
plt.plot(X_test, y_pred, color="red", label="Random Forest Prediction")
plt.title('Random Forest Regressor - Prediction Line')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.show()

# Add space between plots
plt.subplots_adjust(hspace=0.4)

# 2. Tree Structure Visualization
plt.figure(figsize=(20, 10))
plot_tree(regressor.estimators_[0], filled=True, rounded=True, feature_names=['Feature 1'])
plt.title("Random Forest Regressor - Tree 1")
plt.show()

# Add space between plots
plt.subplots_adjust(hspace=0.4)

# 3. Feature Importance Plot
feature_importances = regressor.feature_importances_
features = ['Feature 1']
plt.figure(figsize=(8, 6))
sns.barplot(x=features, y=feature_importances)
plt.title('Feature Importances in Random Forest Regressor')
plt.show()
No description has been provided for this image
<Figure size 640x480 with 0 Axes>
No description has been provided for this image
<Figure size 640x480 with 0 Axes>
No description has been provided for this image

Explanation of the Visuals

  1. Prediction Line Visualization:

This plot shows how the Random Forest Regressor fits the data by averaging the outputs of individual decision trees. The red line represents the predicted values, while the blue points are the actual data points.

  1. Tree Structure Visualization:

This uses the plot_tree function to visualize an individual tree from the Random Forest, showing the splits and conditions used for regression.

  1. Feature Importance Plot:

Displays the importance of each feature in the Random Forest model, indicating how much each feature contributes to the predictions.

For the above display, the feature importance value is extreme and in favor of the feature because the target or dependent variable is structured upon the feature. Concerning the scatter plot with the random forest prediction, the random forest prediction will "converge" to the orientation of the data due to the direct construction of the target from the feature.

Multiple Features:

In [273]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import plot_tree
import seaborn as sns

# Generate synthetic regression data with multiple features
np.random.seed(42)
X = np.random.rand(100, 5) * 10  # 100 data points, five features
y = (
    2 * np.sin(X[:, 0]) +
    3 * np.cos(X[:, 1]) +
    1.5 * X[:, 2] +
    0.5 * X[:, 3] ** 2 +
    np.random.normal(0, 0.5, X.shape[0])  # Adding noise
)

# Train Random Forest Regressor
regressor = RandomForestRegressor(n_estimators=20, random_state=42)
regressor.fit(X, y)

# 1. Prediction Line Visualization (using the first feature)
plt.figure(figsize=(10, 6))
X_test = np.linspace(0, 10, 500).reshape(-1, 1)
y_pred = regressor.predict(np.concatenate([X_test, np.zeros((500, 4))], axis=1))  # Keeping other features constant
plt.scatter(X[:, 0], y, color="blue", label="Data")
plt.plot(X_test, y_pred, color="red", label="Random Forest Prediction")
plt.title('Random Forest Regressor - Prediction Line (First Feature)')
plt.xlabel('Feature 1')
plt.ylabel('Target')
plt.legend()
plt.show()

# 2. Tree Structure Visualization (showing the first tree)
plt.figure(figsize=(20, 10))
plot_tree(regressor.estimators_[0], filled=True, rounded=True, feature_names=[f'Feature {i+1}' for i in range(X.shape[1])])
plt.title("Random Forest Regressor - Tree 1")
plt.show()

# 3. Feature Importance Plot
feature_importances = regressor.feature_importances_
features = [f'Feature {i+1}' for i in range(X.shape[1])]  # Generate feature names dynamically

plt.figure(figsize=(10, 6))
# Assign features to hue and set legend to False
sns.barplot(x=features, y=feature_importances, palette='viridis', hue=features)
plt.title('Feature Importances in Random Forest Regressor')
plt.ylabel('Importance Score')
plt.xlabel('Features')
plt.xticks(rotation=45)  # Rotate feature names for better readability
plt.legend([],[], frameon=False)  # Remove legend
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

As for a target based on multile features, observing the scatter plot, the prediction curve generally should not converge to the orientation of the single first feature because the other features have influence upon the determination of the target. The prediction line concerning the first feature conveys barely any relationship with it. Observing the model, the quadratic term has the dominant influence in the long run.

Now, going back to the real data to implement.

For each target variable (like rain_sum, dew_point_2m_min, etc.):

  1. Use Recursive Feature Elimination (RFE) with a Random Forest Regressor to select the top 5 most important features.

  2. Use those selected features to fit a quantile regression model (median regression, i.e., quantile = 0.5) to understand the effect of those features on the target.

In [47]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# List of target variables
targets = ['rain_sum', 'dew_point_2m_min',
           'dew_point_2m_max',
'et0_fao_evapotranspiration',
           'soil_temperature_0_to_7cm_mean',
           'wet_bulb_temperature_2m_min',
           'wet_bulb_temperature_2m_max',
           'soil_temperature_0_to_7cm_mean']

# Initialize a dictionary to store selected features for each target
selected_features_dict = {}

# Iterate over each target variable
for target in targets:
    # Separate independent and dependent variables
    X = daily_data_sans_first_col.drop(target, axis=1)  # Drop the target column from the features
    y = daily_data_sans_first_col[target]  # Set the target column
    
    # Initialize a RandomForestRegressor
    rf = RandomForestRegressor(n_jobs=-1, max_depth=5)
    
    # Initialize RFE with the desired number of features
    rfe = RFE(estimator=rf, n_features_to_select=5)
    
    # Fit RFE
    rfe.fit(X, y)
    
    # Get the selected features for the current target
    selected_features = rfe.support_
    important_features = X.columns[selected_features].tolist()
    
    # Store the selected features in the dictionary
    selected_features_dict[target] = important_features
    
    # Print the selected features for the current target
    print(f"Selected Features with RFE for {target}: {important_features}")

# Now, for each target, perform quantile regression using the selected features
for target, selected_features in selected_features_dict.items():
    # Prepare the independent variables (selected features)
    X_selected = daily_data_sans_first_col[selected_features]
    
    # Add a constant (intercept term)
    X_selected = sm.add_constant(X_selected)
    
    # Dependent variable (response)
    y = daily_data_sans_first_col[target]
    
    # Fit the quantile regression model at the 0.5 quantile
    model = sm.QuantReg(y, X_selected)
    quantile_50 = model.fit(q=0.5)
    
    # Print the summary of the quantile regression for the current target
    print(f"Quantile Regression Summary for {target}:")
    print(quantile_50.summary())
    print("\n" + "="*80 + "\n")
Selected Features with RFE for rain_sum: ['wind_speed_10m_max', 'et0_fao_evapotranspiration', 'surface_pressure_min', 'relative_humidity_2m_max', 'wet_bulb_temperature_2m_max']
Selected Features with RFE for dew_point_2m_min: ['rain_sum', 'relative_humidity_2m_min', 'wet_bulb_temperature_2m_min', 'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean']
Selected Features with RFE for dew_point_2m_max: ['temperature_2m_max', 'relative_humidity_2m_max', 'wet_bulb_temperature_2m_max', 'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean']
Selected Features with RFE for et0_fao_evapotranspiration: ['temperature_2m_mean', 'apparent_temperature_max', 'wind_speed_10m_max', 'rain_sum', 'vapour_pressure_deficit_max']
Selected Features with RFE for soil_temperature_0_to_7cm_mean: ['temperature_2m_max', 'apparent_temperature_max', 'et0_fao_evapotranspiration', 'wet_bulb_temperature_2m_min', 'vapour_pressure_deficit_max']
Selected Features with RFE for wet_bulb_temperature_2m_min: ['temperature_2m_mean', 'temperature_2m_max', 'temperature_2m_min', 'dew_point_2m_min', 'wet_bulb_temperature_2m_max']
Selected Features with RFE for wet_bulb_temperature_2m_max: ['temperature_2m_mean', 'temperature_2m_max', 'dew_point_2m_max', 'wet_bulb_temperature_2m_min', 'soil_temperature_0_to_7cm_mean']
Selected Features with RFE for soil_temperature_0_to_7cm_mean: ['temperature_2m_max', 'apparent_temperature_max', 'et0_fao_evapotranspiration', 'wet_bulb_temperature_2m_min', 'vapour_pressure_deficit_max']
Quantile Regression Summary for rain_sum:
                         QuantReg Regression Results                          
==============================================================================
Dep. Variable:               rain_sum   Pseudo R-squared:               0.1378
Model:                       QuantReg   Bandwidth:                      0.2189
Method:                 Least Squares   Sparsity:                        2.360
Date:                Fri, 27 Jun 2025   No. Observations:                16603
Time:                        22:43:22   Df Residuals:                    16597
                                        Df Model:                            5
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const                         123.4087      5.599     22.043      0.000     112.435     134.383
wind_speed_10m_max              0.0444      0.002     29.388      0.000       0.041       0.047
et0_fao_evapotranspiration     -0.5483      0.017    -32.035      0.000      -0.582      -0.515
surface_pressure_min           -0.1327      0.006    -23.109      0.000      -0.144      -0.121
relative_humidity_2m_max        0.0922      0.003     32.625      0.000       0.087       0.098
wet_bulb_temperature_2m_max     0.0358      0.009      3.853      0.000       0.018       0.054
===============================================================================================

The condition number is large, 6.01e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

================================================================================

Quantile Regression Summary for dew_point_2m_min:
                         QuantReg Regression Results                          
==============================================================================
Dep. Variable:       dew_point_2m_min   Pseudo R-squared:               0.9204
Model:                       QuantReg   Bandwidth:                     0.01984
Method:                 Least Squares   Sparsity:                       0.2287
Date:                Fri, 27 Jun 2025   No. Observations:                16603
Time:                        22:43:22   Df Residuals:                    16597
                                        Df Model:                            5
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const                            -18.5247      0.079   -235.023      0.000     -18.679     -18.370
rain_sum                           0.0047      0.000     24.119      0.000       0.004       0.005
relative_humidity_2m_min           0.2377      0.001    180.456      0.000       0.235       0.240
wet_bulb_temperature_2m_min        0.7980      0.002    444.523      0.000       0.794       0.802
vapour_pressure_deficit_max        3.9190      0.035    111.079      0.000       3.850       3.988
soil_temperature_0_to_7cm_mean     0.0241      0.001     17.985      0.000       0.021       0.027
==================================================================================================

The condition number is large, 7.78e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

================================================================================

Quantile Regression Summary for dew_point_2m_max:
                         QuantReg Regression Results                          
==============================================================================
Dep. Variable:       dew_point_2m_max   Pseudo R-squared:               0.9058
Model:                       QuantReg   Bandwidth:                     0.02402
Method:                 Least Squares   Sparsity:                       0.3092
Date:                Fri, 27 Jun 2025   No. Observations:                16603
Time:                        22:43:22   Df Residuals:                    16597
                                        Df Model:                            5
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const                             -4.1291      0.032   -128.618      0.000      -4.192      -4.066
temperature_2m_max                -0.1994      0.005    -39.337      0.000      -0.209      -0.189
relative_humidity_2m_max           0.0462      0.000    132.738      0.000       0.046       0.047
wet_bulb_temperature_2m_max        1.2237      0.004    291.276      0.000       1.215       1.232
vapour_pressure_deficit_max        0.4028      0.015     26.619      0.000       0.373       0.432
soil_temperature_0_to_7cm_mean    -0.0359      0.002    -20.659      0.000      -0.039      -0.033
==================================================================================================

The condition number is large, 2.56e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

================================================================================

Quantile Regression Summary for et0_fao_evapotranspiration:
                             QuantReg Regression Results                              
======================================================================================
Dep. Variable:     et0_fao_evapotranspiration   Pseudo R-squared:               0.3770
Model:                               QuantReg   Bandwidth:                     0.08506
Method:                         Least Squares   Sparsity:                        1.080
Date:                        Fri, 27 Jun 2025   No. Observations:                16603
Time:                                22:43:22   Df Residuals:                    16597
                                                Df Model:                            5
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const                           1.0758      0.109      9.836      0.000       0.861       1.290
temperature_2m_mean            -0.1625      0.010    -16.569      0.000      -0.182      -0.143
apparent_temperature_max        0.1827      0.005     36.266      0.000       0.173       0.193
wind_speed_10m_max              0.0474      0.001     44.400      0.000       0.045       0.049
rain_sum                       -0.1016      0.001   -110.055      0.000      -0.103      -0.100
vapour_pressure_deficit_max     1.4703      0.018     80.902      0.000       1.435       1.506
===============================================================================================

The condition number is large, 1.24e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

================================================================================

Quantile Regression Summary for soil_temperature_0_to_7cm_mean:
                               QuantReg Regression Results                                
==========================================================================================
Dep. Variable:     soil_temperature_0_to_7cm_mean   Pseudo R-squared:               0.6308
Model:                                   QuantReg   Bandwidth:                     0.08111
Method:                             Least Squares   Sparsity:                        1.037
Date:                            Fri, 27 Jun 2025   No. Observations:                16603
Time:                                    22:43:22   Df Residuals:                    16597
                                                    Df Model:                            5
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const                           3.2281      0.099     32.453      0.000       3.033       3.423
temperature_2m_max              0.9753      0.014     70.922      0.000       0.948       1.002
apparent_temperature_max        0.0731      0.003     23.266      0.000       0.067       0.079
et0_fao_evapotranspiration     -0.3908      0.007    -54.470      0.000      -0.405      -0.377
wet_bulb_temperature_2m_min    -0.1405      0.011    -12.357      0.000      -0.163      -0.118
vapour_pressure_deficit_max     1.2173      0.061     20.052      0.000       1.098       1.336
===============================================================================================

The condition number is large, 1.11e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

================================================================================

Quantile Regression Summary for wet_bulb_temperature_2m_min:
                              QuantReg Regression Results                              
=======================================================================================
Dep. Variable:     wet_bulb_temperature_2m_min   Pseudo R-squared:               0.8959
Model:                                QuantReg   Bandwidth:                     0.02574
Method:                          Least Squares   Sparsity:                       0.3332
Date:                         Fri, 27 Jun 2025   No. Observations:                16603
Time:                                 22:43:23   Df Residuals:                    16597
                                                 Df Model:                            5
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
const                           0.5340      0.029     18.250      0.000       0.477       0.591
temperature_2m_mean             0.3632      0.009     41.830      0.000       0.346       0.380
temperature_2m_max             -0.0747      0.004    -17.406      0.000      -0.083      -0.066
temperature_2m_min              0.0433      0.004      9.748      0.000       0.035       0.052
dew_point_2m_min                0.5662      0.002    279.097      0.000       0.562       0.570
wet_bulb_temperature_2m_max     0.0638      0.004     15.458      0.000       0.056       0.072
===============================================================================================

The condition number is large, 1.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

================================================================================

Quantile Regression Summary for wet_bulb_temperature_2m_max:
                              QuantReg Regression Results                              
=======================================================================================
Dep. Variable:     wet_bulb_temperature_2m_max   Pseudo R-squared:               0.9010
Model:                                QuantReg   Bandwidth:                     0.02272
Method:                          Least Squares   Sparsity:                       0.2932
Date:                         Fri, 27 Jun 2025   No. Observations:                16603
Time:                                 22:43:23   Df Residuals:                    16597
                                                 Df Model:                            5
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const                              0.3041      0.025     12.066      0.000       0.255       0.354
temperature_2m_mean                0.2068      0.004     51.538      0.000       0.199       0.215
temperature_2m_max                 0.0914      0.004     22.976      0.000       0.084       0.099
dew_point_2m_max                   0.6201      0.002    253.660      0.000       0.615       0.625
wet_bulb_temperature_2m_min        0.0606      0.003     20.632      0.000       0.055       0.066
soil_temperature_0_to_7cm_mean    -0.0002      0.002     -0.138      0.890      -0.004       0.003
==================================================================================================

The condition number is large, 1.17e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

================================================================================

Interpretation of the Summary Statistics

  1. Number of Observations: value indicates that there are such number of data points used in this regression.

  2. Df Residuals: value represents the degrees of freedom for the residuals (observations - model parameters).

  3. Df Model: value indicates the number of predictors (excluding the constant).

  4. P-values: If the p-values (for the intercept, feature1, feature2,…, featureN) are less than 0.05, there’s indication that the coefficients are statistically significant at a 5% significance level.

  5. Confidence Intervals: The intervals for each coefficient (e.g., [0.025, 0.975]) provide a range within which we can be 95% confident that the true parameter value lies.

  6. Pseudo R-squared: for the value observed, convert to percentage. such measure gives indication that determined percentage accounts for the variability in the target that is explained by the model. Although it's not the same as R-squared in OLS regression, a higher value suggests a good fit

  7. In the context of quantile regression, one of the most commonly used pseudo R² metrics is Koenker and Machado’s pseudo R². This is a specific form of pseudo R² that was developed to assess the fit of quantile regression models.

Koenker and Machado's Pseudo R²: designed to assess how well the model explains the variability in the data, relative to a baseline (often the model that predicts the median).

The Koenker and Machado pseudo R² is defined as:

$$R^2 = 1 - \frac{V(\hat{\theta})}{V(\hat{\theta}_0)}$$

$V(\hat{\theta})$ is the sum of weighted absolute residuals for the fitted model.

$V(\hat{\theta}_0)$ is the sum of weighted absolute residuals for the baseline model (usually a model predicting the median).

Interpretation: It compares the sum of residuals of your model to the residuals of a simpler baseline model. If your model performs better than the baseline, the pseudo $R^2$ will be positive, and if it performs worse, it can be negative.

With daily meteorological/climate data forecasting the associated attributes for long-term predictions presents several challenges that make it less practical. Here are some key reasons:

  1. High Variability:

A respective attribute can be highly variable, influenced by many factors like weather patterns, geographical location, and seasonality. This variability makes it difficult to produce reliable long-term forecasts.

  1. Inherent Noise:

Daily weather data is often noisy, with short-term fluctuations that can overshadow longer-term trends. This noise can complicate the forecasting process, as models may struggle to discern meaningful patterns.

  1. Seasonal Patterns:

Attributes like rainfall tends to exhibit seasonal patterns (e.g., wet and dry seasons), which might require different modeling approaches for different times of the year. Long-term forecasts may fail to capture these nuances.

  1. External Influences:

Long-term rainfall trends can be influenced by climate change, urbanization, and other large-scale environmental changes. These factors may not be adequately represented in historical data used for forecasting.

  1. Data Limitations:

Daily (rain_sum) data may be limited in historical depth or spatial coverage, especially in areas with fewer weather stations. This can affect the quality of long-term forecasts.

  1. Non-stationarity:

Climate patterns and (rainfall) distributions can change over time due to climate change and other factors, leading to non-stationarity in the data. This poses challenges for traditional forecasting models, which often assume that historical patterns will persist.

  1. Forecasting Horizon:

Long-term forecasts (e.g., months or years ahead) may be more appropriate for aggregate measures (like monthly or annual rainfall) rather than daily values. The uncertainty associated with daily forecasts increases significantly over longer time horizons.

  1. Practical Applications:

Many practical applications (like agriculture or water resource management) might benefit more from monthly or seasonal rainfall forecasts rather than daily forecasts. Longer aggregation periods can provide more relevant information for decision-making.

Instead of using daily attribute measures, consider forecasting monthly or seasonal totals, which may better capture underlying trends and patterns while mitigating some of the issues mentioned above. This could improve the reliability and applicability of your forecasts in various contexts.

The majority of the identified daily meteorological variables don't capture atmospheric physics or atmospheric chemistry to generally (analytically) model precipitation. Such data set is best suited for time series analysis, modeling and forecasting long term to analyze drastic climate variations.

The prior summary statistics for each model convey strong multicollinearity issues. Again, reviewing the correlation heatmap matrix.

In [50]:
# Applying Pearson correlation to the data set
daily_pearson_corr = daily_dataframe_clean.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (18, 14))
sns.heatmap(daily_pearson_corr, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation heatmap of Montserrat Daily Meteorological Data')
plt.savefig('daily_heatmap.pdf', format = 'pdf')
plt.show()
No description has been provided for this image

Observed above are some observations of high correlation among pairs; exclude the diagonal elements due to the obvious.

A potential resolution for such multicollinearity is to observe the feature importance/rank w.r.t. the target in question. After observing the level of importance or rank, to then observe the correlation heat map. For a high positive correlation pair (where the idea of "high" is subjective), to choose the feature of higher importance.

In [53]:
from sklearn.model_selection import train_test_split

# List of targets
targets = [
    'rain_sum', 'dew_point_2m_min', 'dew_point_2m_max',
    'et0_fao_evapotranspiration', 'soil_temperature_0_to_7cm_mean',
    'wet_bulb_temperature_2m_min', 'wet_bulb_temperature_2m_max'
]

# Loop through each target
for target in targets:
    print(f"\n{'='*60}\nAnalyzing Target: {target}\n{'='*60}")

    # Define features: drop current target from targets list + use all other columns
    possible_features = daily_data_sans_first_col.drop(columns=[target])

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        possible_features, 
        daily_data_sans_first_col[target], 
        test_size=0.2, 
        random_state=42
    )

    # Initialize model
    rf_model = RandomForestRegressor(n_estimators=50, random_state=42)

    # Fit model
    rf_model.fit(X_train, y_train)

    # Feature importances
    importances = rf_model.feature_importances_
    feature_importances = pd.DataFrame({
        'Feature': X_train.columns,
        'Importance': importances
    }).sort_values(by='Importance', ascending=False)

    # Plot feature importances
    plt.figure(figsize=(12, 6))
    plt.barh(feature_importances['Feature'], feature_importances['Importance'], color='skyblue')
    plt.xlabel('Importance')
    plt.title(f'Feature Importances for Target: {target}')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

    # Print ranked features
    print("Ranked Features based on Importance:")
    print(feature_importances)

    # Recursive Feature Elimination
    rfe = RFE(estimator=rf_model, n_features_to_select=5)
    rfe.fit(X_train, y_train)
    selected_features = X_train.columns[rfe.support_]

    print("Selected Features by RFE:")
    print(selected_features.tolist())
============================================================
Analyzing Target: rain_sum
============================================================
No description has been provided for this image
Ranked Features based on Importance:
                           Feature  Importance
7       et0_fao_evapotranspiration    0.489625
14        relative_humidity_2m_max    0.137958
6               wind_speed_10m_max    0.102763
15        relative_humidity_2m_min    0.027880
19  soil_temperature_0_to_7cm_mean    0.025270
5         apparent_temperature_min    0.023284
16     wet_bulb_temperature_2m_max    0.020404
8                 dew_point_2m_max    0.018505
13                pressure_msl_min    0.018271
1               temperature_2m_max    0.016599
18     vapour_pressure_deficit_max    0.016413
11            surface_pressure_min    0.015052
0              temperature_2m_mean    0.013708
2               temperature_2m_min    0.013610
9                 dew_point_2m_min    0.012642
4         apparent_temperature_max    0.011980
17     wet_bulb_temperature_2m_min    0.011395
3        apparent_temperature_mean    0.009613
12                pressure_msl_max    0.007716
10            surface_pressure_max    0.007313
Selected Features by RFE:
['wind_speed_10m_max', 'et0_fao_evapotranspiration', 'relative_humidity_2m_max', 'relative_humidity_2m_min', 'wet_bulb_temperature_2m_max']

============================================================
Analyzing Target: dew_point_2m_min
============================================================
No description has been provided for this image
Ranked Features based on Importance:
                           Feature  Importance
17     wet_bulb_temperature_2m_min    0.939701
15        relative_humidity_2m_min    0.049060
1               temperature_2m_max    0.002069
8                         rain_sum    0.001350
2               temperature_2m_min    0.001314
16     wet_bulb_temperature_2m_max    0.001238
19  soil_temperature_0_to_7cm_mean    0.000893
18     vapour_pressure_deficit_max    0.000656
14        relative_humidity_2m_max    0.000488
7       et0_fao_evapotranspiration    0.000439
0              temperature_2m_mean    0.000429
4         apparent_temperature_max    0.000416
9                 dew_point_2m_max    0.000410
5         apparent_temperature_min    0.000294
6               wind_speed_10m_max    0.000275
3        apparent_temperature_mean    0.000275
10            surface_pressure_max    0.000199
11            surface_pressure_min    0.000191
12                pressure_msl_max    0.000153
13                pressure_msl_min    0.000150
Selected Features by RFE:
['temperature_2m_max', 'temperature_2m_min', 'relative_humidity_2m_min', 'wet_bulb_temperature_2m_max', 'wet_bulb_temperature_2m_min']

============================================================
Analyzing Target: dew_point_2m_max
============================================================
No description has been provided for this image
Ranked Features based on Importance:
                           Feature  Importance
16     wet_bulb_temperature_2m_max    0.950799
14        relative_humidity_2m_max    0.030534
1               temperature_2m_max    0.005117
19  soil_temperature_0_to_7cm_mean    0.003065
18     vapour_pressure_deficit_max    0.002229
2               temperature_2m_min    0.001954
15        relative_humidity_2m_min    0.001047
17     wet_bulb_temperature_2m_min    0.000822
7       et0_fao_evapotranspiration    0.000605
0              temperature_2m_mean    0.000540
8                         rain_sum    0.000458
5         apparent_temperature_min    0.000438
9                 dew_point_2m_min    0.000422
4         apparent_temperature_max    0.000403
6               wind_speed_10m_max    0.000361
3        apparent_temperature_mean    0.000329
11            surface_pressure_min    0.000240
10            surface_pressure_max    0.000238
13                pressure_msl_min    0.000205
12                pressure_msl_max    0.000194
Selected Features by RFE:
['temperature_2m_max', 'relative_humidity_2m_max', 'wet_bulb_temperature_2m_max', 'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean']

============================================================
Analyzing Target: et0_fao_evapotranspiration
============================================================
No description has been provided for this image
Ranked Features based on Importance:
                           Feature  Importance
18     vapour_pressure_deficit_max    0.374105
7                         rain_sum    0.218456
4         apparent_temperature_max    0.063076
6               wind_speed_10m_max    0.049613
11            surface_pressure_min    0.047105
0              temperature_2m_mean    0.039581
19  soil_temperature_0_to_7cm_mean    0.034692
14        relative_humidity_2m_max    0.034592
15        relative_humidity_2m_min    0.019304
5         apparent_temperature_min    0.017646
2               temperature_2m_min    0.015994
3        apparent_temperature_mean    0.012797
1               temperature_2m_max    0.011430
8                 dew_point_2m_max    0.010334
10            surface_pressure_max    0.010178
16     wet_bulb_temperature_2m_max    0.009248
17     wet_bulb_temperature_2m_min    0.008927
13                pressure_msl_min    0.008315
9                 dew_point_2m_min    0.007757
12                pressure_msl_max    0.006849
Selected Features by RFE:
['temperature_2m_mean', 'apparent_temperature_max', 'rain_sum', 'surface_pressure_min', 'vapour_pressure_deficit_max']

============================================================
Analyzing Target: soil_temperature_0_to_7cm_mean
============================================================
No description has been provided for this image
Ranked Features based on Importance:
                        Feature  Importance
1            temperature_2m_max    0.778519
19  vapour_pressure_deficit_max    0.070758
4      apparent_temperature_max    0.031376
7    et0_fao_evapotranspiration    0.021642
18  wet_bulb_temperature_2m_min    0.015745
0           temperature_2m_mean    0.010050
6            wind_speed_10m_max    0.009869
8                      rain_sum    0.007562
2            temperature_2m_min    0.007029
9              dew_point_2m_max    0.006860
12         surface_pressure_min    0.005714
3     apparent_temperature_mean    0.005442
15     relative_humidity_2m_max    0.004697
5      apparent_temperature_min    0.004571
11         surface_pressure_max    0.004035
14             pressure_msl_min    0.003829
17  wet_bulb_temperature_2m_max    0.003684
10             dew_point_2m_min    0.003024
16     relative_humidity_2m_min    0.002981
13             pressure_msl_max    0.002614
Selected Features by RFE:
['temperature_2m_max', 'apparent_temperature_max', 'et0_fao_evapotranspiration', 'wet_bulb_temperature_2m_min', 'vapour_pressure_deficit_max']

============================================================
Analyzing Target: wet_bulb_temperature_2m_min
============================================================
No description has been provided for this image
Ranked Features based on Importance:
                           Feature  Importance
10                dew_point_2m_min    0.807159
17     wet_bulb_temperature_2m_max    0.151958
0              temperature_2m_mean    0.016057
2               temperature_2m_min    0.013240
1               temperature_2m_max    0.001397
15        relative_humidity_2m_max    0.001220
9                 dew_point_2m_max    0.001170
18     vapour_pressure_deficit_max    0.001149
5         apparent_temperature_min    0.001103
16        relative_humidity_2m_min    0.001088
19  soil_temperature_0_to_7cm_mean    0.000786
3        apparent_temperature_mean    0.000676
4         apparent_temperature_max    0.000601
7       et0_fao_evapotranspiration    0.000579
8                         rain_sum    0.000539
6               wind_speed_10m_max    0.000415
12            surface_pressure_min    0.000251
11            surface_pressure_max    0.000228
14                pressure_msl_min    0.000194
13                pressure_msl_max    0.000191
Selected Features by RFE:
['temperature_2m_mean', 'temperature_2m_min', 'dew_point_2m_min', 'wet_bulb_temperature_2m_max', 'vapour_pressure_deficit_max']

============================================================
Analyzing Target: wet_bulb_temperature_2m_max
============================================================
No description has been provided for this image
Ranked Features based on Importance:
                           Feature  Importance
9                 dew_point_2m_max    0.954012
0              temperature_2m_mean    0.027019
1               temperature_2m_max    0.007559
10                dew_point_2m_min    0.001741
15        relative_humidity_2m_max    0.001664
19  soil_temperature_0_to_7cm_mean    0.001072
16        relative_humidity_2m_min    0.000934
17     wet_bulb_temperature_2m_min    0.000914
2               temperature_2m_min    0.000824
8                         rain_sum    0.000634
4         apparent_temperature_max    0.000547
7       et0_fao_evapotranspiration    0.000530
3        apparent_temperature_mean    0.000475
18     vapour_pressure_deficit_max    0.000469
5         apparent_temperature_min    0.000425
6               wind_speed_10m_max    0.000353
11            surface_pressure_max    0.000236
12            surface_pressure_min    0.000225
14                pressure_msl_min    0.000185
13                pressure_msl_max    0.000180
Selected Features by RFE:
['temperature_2m_mean', 'temperature_2m_max', 'dew_point_2m_max', 'dew_point_2m_min', 'relative_humidity_2m_max']

AGAIN, a potential resolution for such multicollinearity is to observe the feature importance/rank w.r.t. the target in question. After observing the level of importance or rank, to then observe the correlation heat map. For a high positive correlation pair (where the idea of "high" is subjective), to choose the feature of higher importance.

Analyzing How Weather Patterns Have Changed Over Time for a Particular Month Across Multiple Years¶

This analysis involves examining historical weather data for a specific month across several years. By studying variables such as temperature, precipitation, relative humidity, wind patterns, etc. Researchers can identify trends and changes in climate over time. This information is crucial for understanding long-term climate variability, predicting future weather patterns, and assessing the impacts of climate change.

This code to effectively filter the meteorological data for a specific month, calculates the average maximum temperature for that month across different years, and visualizes the results in a line plot. Code can be adjusted to analyze other months as needed.

Analysis of a particular variable's trend within a particular month across multiple years. To visualize the average maximum of a variable in question for the selected month over time, providing valuable insights into climate patterns and potential changes. The month of July is chosen because such period typically records the highest temperatures in the northern hemisphere; January also chosen because this period typically records the lowest temperatures related to the Earth's tilt w.r.t. its axis.

Case for Max Temperature 2 meters above ground in July:

In [57]:
# Extract year and month from the 'date' column using .loc to avoid SettingWithCopyWarning
daily_dataframe_part = daily_dataframe.copy()
daily_dataframe_part.loc[:, 'year'] = daily_dataframe_part['date'].dt.year
daily_dataframe_part.loc[:, 'month'] = daily_dataframe_part['date'].dt.month

# Filter for a specific month (e.g., July = 7)
specific_month = 7  # Change this to the month you want to analyze
monthly_data = daily_dataframe_part[daily_dataframe_part['month'] == specific_month]

# Group by year and calculate the mean of a specific variable, e.g., 'temperature_2m_max'
monthly_mean = monthly_data.groupby('year')['temperature_2m_max'].mean().reset_index()

# Plotting the results
plt.figure(figsize=(12, 6))
sns.lineplot(data=monthly_mean, x='year', y='temperature_2m_max', marker='o')  # This line is from seaborn
plt.title(f'Average Max Temperature in Month {specific_month} Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Max Temperature (°C)')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Case for Min Temperature 2 meters above ground in July:

In [59]:
# Ensure 'date' is a datetime object
daily_dataframe['date'] = pd.to_datetime(daily_dataframe['date'])

# Extract 'month' and 'year' from the 'date' column
daily_dataframe['month'] = daily_dataframe['date'].dt.month
daily_dataframe['year'] = daily_dataframe['date'].dt.year

# Now filter by month
specific_month = 7  # July
monthly_data = daily_dataframe[daily_dataframe['month'] == specific_month]

# Group by year and calculate the mean of temperature_2m_min
monthly_mean = monthly_data.groupby('year')['temperature_2m_min'].mean().reset_index()




# Plotting the results
plt.figure(figsize=(12, 6))
sns.lineplot(data=monthly_mean, x='year', y='temperature_2m_min', marker='o')  # This line is from seaborn
plt.title(f'Average Min Temperature in Month {specific_month} Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Min Temperature (°C)')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Case for Precipitation in July:

In [61]:
# Filter for a specific month (e.g., July = 7)
specific_month = 7  # Change this to the month you want to analyze
daily_dataframe = daily_dataframe[daily_dataframe['month'] == specific_month]

# Group by year and calculate the mean of 'rain_sum'
monthly_mean = monthly_data.groupby('year')['rain_sum'].mean().reset_index()

# Plotting the results
plt.figure(figsize=(12, 6))
sns.lineplot(data=monthly_mean, x='year', y='rain_sum', marker='o', color='blue')
plt.title(f'Average Rain Sum Per Day in Month {specific_month} Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Rain Sum Per Day')
plt.xticks(monthly_mean['year'], rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Concerning the above scatter plots, for TMIN AND TMAX one should keep in mind that the average for each 1st month (January) of each year is considered, and not actual minimum of maximum. So, highly dynamic curves are not to be expected. A noticeable change in slope conveys that daily lower maximums or minimuns each year are increasing.

Statistical Method to Identify Significant Change in Climate¶

Statistical Significance

In statistics, well-known, the p-value (exhaustively applied) indicates whether there's statistical significance (whether difference or influence) in the segmentation of the variable(s) between the two periods. A p-value less than 0.05 suggests a significance (whether difference or influence).

Wilson Signed-Rank Test and Mann-Whitney U Test: A Comparative Overview¶

The Wilcoxon Signed-Rank Test and the Mann-Whitney U Test are two nonparametric statistical tests commonly used to compare the medians of two groups. These tests are particularly useful when the data does not meet the assumptions of parametric tests like the t-test, such as normality or homogeneity of variance.

Wilson Signed-Rank Test

The Wilcoxon Signed-Rank Test is used when the two groups being compared are paired or dependent (Hayes 2019). This means that each observation in one group corresponds to a specific observation in the other group. For example, it can be used to compare the pre- and post-treatment scores of the same individuals.

The test ranks the absolute differences between the paired observations and then sums the ranks of the differences that have the same sign. The resulting sum is compared to a critical value to determine if there is a significant difference between the medians of the two groups.

WSRT-HYPOTHESES --

A. Null Hypothesis: median difference between the paired observations is zero ($M_D = 0$).

$$H_0:M_D = 0$$

B. Alternative Hypothesis: median difference between the paired observations is not zero ($M_D \neq\, 0$). Such can be non-directional (as is in median difference not being 0); Directional, as in median difference being positive (say, if group 1 is greater than group 2); Directional, as in in median difference being negative (say, if group 1 is less than group 2).

C. *Test Statistic: calculate the signed ranks of the differences between paired observations, then sum the ranks for the positive and negative differences to obtain the test statistic.

D. Decision Rule: compare the test statistic to critical values from the Wilcoxon signed-rank table or use a p-value to determine significance. Reject $H_0$ if the p-value is less than the chosen significance level ($\alpha$).

STEP 1: Calculate the Paired Differences --

Given two related samples, $X = \{x_1,x_2,...,x_n\}$ and $Y = \{y_1,y_2,...,y_n\}$

$$D_i = x_i - y_i$$

where $i$ = 1,2,....n.

Ignore pairs where $D_i = 0$ (ties are removed).

STEP 2: Compute Absolute Differences and Ranks --

Compute the absolute differences:

$$|D_i|\,\,\text{for}\,\,D_i\,\neq\,0$$

Rank the absolute didfferences in ascending order. Assign average ranks for tied values.

STEP 3: Assign Signs to Ranks --

Restore the sign of $D_i$ to its rank:

$$R_i = \begin{cases} +\text{Rank}(|D_i|) & \text{if } D_i > 0, \\ -\text{Rank}(|D_i|) & \text{if } D_i < 0. \end{cases}$$

STEP 4: Compute the Taest Statistic --

Calculate the positive rank sum ($W^+$) and the negative rank sum ($W^-$):

$$RW^+ = \sum_{R_i > 0} R_i, \quad W^- = \sum_{R_i < 0} |R_i|$$

The test statistic $W$ is the smaller of the two:

$$W = \text{min}(W^+, W^-)$$

Mann-Whitney U Test

The Mann-Whitney U Test, also known as the Wilcoxon Rank-Sum Test, is used when the two groups being compared are independent (MacFaraland and Yates 2016). This means that there is no one-to-one correspondence between the observations in the two groups. For example, it can be used to compare the test scores of two different groups of students.

The test ranks all observations from both groups combined, then calculates the sum of the ranks for one of the groups. This sum is compared to a critical value to determine if there is a significant difference between the medians of the two groups.

MWUT-HYPOTHESES--

A. Null Hypothesis: the distributions of the two groups are equal.

$$H_0: F_X(t) = F_Y(t) \forall\,t$$

where $F_X(t)$ and $F_Y(t)$ are the cumulative distribution functions of the two populations.

B. Alternative Hypothesis: the distributions of the two groups are not equal. This can be directional (one-tailed) or non-directional (two-tailed); the former, simply establishing non-equivalence, while the latter being the greater than or less than case with distribution contrast.

$$H_1: F_X(t) \neq\,\, F_Y(t) \text{ for some}t$$

C. Test Statistic: Calculate the U statistic, which is based on the ranks of the combined data from both groups.

STEP 1: Combine and Rank the data --

Let $X = \{x_1,x_2,...,x_n\}$ and $Y = \{y_1,y_2,...,y_n\}$ represent the tow independent samples.

Combine the two samples into a single data set.

Rank all observations in ascending order, assigning averaged ranks to tied values.

STEP 2: Compute the Ranked Sums --

Compute the sum of ranks for each group:

$$R_X = \sum_{x \in X} \text{Rank}(x), \quad R_Y = \sum_{y \in Y} \text{Rank}(y)$$

STEP 3: Compute the U Statistic --

Calculate the U Statistic for each group:

$$U_X = R_X - \frac{n_X (n_X + 1)}{2}, \quad U_Y = R_Y - \frac{n_Y (n_Y + 1)}{2}$$

The two statistics are related:

$$U_x + U_Y = n_X\,n_Y$$

The test statistic U is the smaller of the two:

$$U = \text{min}(U_X,U_Y)$$

D. Compare the calculated U statistic to critical values from the Mann-Whitney U distribution table or use a p-value to determine significance. Reject the null hypothesis if the p-value is less than the chosen significance level ($\alpha$).

Considerations

The Wilcoxon Signed-Rank Test is typically used when you have paired observations, meaning that each data point in one period has a corresponding data point in the other period. For example, if you compare temperatures for the same month in two different years (e.g., January 1912 vs. January 1968), then the observations are paired.

If comparing aggregate measures (like average monthly temperatures) between two distinct periods without direct pairing of observations (i.e., you have one group of data for the earlier period and another group for the later period without matching), then the Wilcoxon Signed-Rank Test may not be appropriate. Instead, the Mann-Whitney U Test could be considered if you treat the two periods as independent groups. Both tests can be implemented with the appropriate specification, however, to implement the Mann-Whitney U Test, because the concern generally is weather data from different time periods.

When splitting the weather dataset into two periods (1912–1967 and 1968–2024) and compare them, the comparison is primarily focused on the attributes (e.g., temperature, precipitation, snowfall) within each period, rather than the years themselves. The reasoning:

  1. Attributes as the Basis of Comparison:

The aim of your analysis is to determine if there are statistically significant differences in the distribution, mean, median, or other characteristics of weather attributes between the two periods.

For instance, you may be interested in seeing if the average temperature or average precipitation levels have changed significantly from one period to the next.

The years themselves are just the framework for segmenting the data; the attributes (like temperature or precipitation) are the variables you are actually comparing.

  1. Years as Contextual Groupings:

By splitting the dataset based on years, you are essentially creating two groups or "batches" of data where the weather attributes are measured across different time frames.

The two time periods serve as the independent grouping factor, and you are testing whether the attributes (temperature, precipitation, etc.) exhibit different patterns between these two periods.

  1. Temporal Influence and Aggregation:

Weather data is inherently time-dependent, and by aggregating the data within each period, you are accounting for the overall trend or changes that might have happened over those years.

The comparison reflects how the overall weather patterns or averages of each attribute differ between these long-term periods, rather than focusing on year-to-year variations.

In [64]:
from scipy.stats import mannwhitneyu
ddy = daily_dataframe.copy()

period1 = ddy.loc['1980':'2002']
period2 = ddy.loc['2003':'2025']

Histogram of Period 1 Attributes¶

In [66]:
# Get the column names
column_names = period1.columns
print(column_names)
column_names_list = column_names.tolist()

# Calculating the number of ros and columns for subplots.
num_cols = 3  # 3 columns
num_rows = (len(column_names_list) + num_cols - 1) // num_cols
     # Calculating the number of rows

# Creating subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize = (15, 10))

# Flatten if required.
if num_rows > 1:
  axes = axes.flatten()

# Plot the histograms
for i, col in enumerate(column_names_list):
  sns.histplot(data = period1[col], ax = axes[i], kde = True)
  axes[i].set_title(f'Histogram of Period 1 {col}')
  axes[i].set_xlabel('Value')
  axes[i].set_ylabel('Frequency')
  axes[i].grid(True)
# Adjust layout
plt.tight_layout()
plt.show()
Index(['date', 'temperature_2m_mean', 'temperature_2m_max',
       'temperature_2m_min', 'apparent_temperature_mean',
       'apparent_temperature_max', 'apparent_temperature_min',
       'wind_speed_10m_max', 'et0_fao_evapotranspiration', 'rain_sum',
       'dew_point_2m_max', 'dew_point_2m_min', 'surface_pressure_max',
       'surface_pressure_min', 'pressure_msl_max', 'pressure_msl_min',
       'relative_humidity_2m_max', 'relative_humidity_2m_min',
       'wet_bulb_temperature_2m_max', 'wet_bulb_temperature_2m_min',
       'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean',
       'month', 'year'],
      dtype='object')
No description has been provided for this image

Histogram of Period 2 Attributes¶

In [68]:
# Get the column names
column_names = period2.columns
print(column_names)
column_names_list = column_names.tolist()

# Calculating the number of ros and columns for subplots.
num_cols = 3  # 3 columns
num_rows = (len(column_names_list) + num_cols - 1) // num_cols
     # Calculating the number of rows

# Creating subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize = (15, 10))

# Flatten if required.
if num_rows > 1:
  axes = axes.flatten()

# Plot the histograms
for i, col in enumerate(column_names_list):
  sns.histplot(data = period2[col], ax = axes[i], kde = True)
  axes[i].set_title(f'Histogram of Period 2 {col}')
  axes[i].set_xlabel('Value')
  axes[i].set_ylabel('Frequency')
  axes[i].grid(True)
# Adjust layout
plt.tight_layout()
plt.show()
Index(['date', 'temperature_2m_mean', 'temperature_2m_max',
       'temperature_2m_min', 'apparent_temperature_mean',
       'apparent_temperature_max', 'apparent_temperature_min',
       'wind_speed_10m_max', 'et0_fao_evapotranspiration', 'rain_sum',
       'dew_point_2m_max', 'dew_point_2m_min', 'surface_pressure_max',
       'surface_pressure_min', 'pressure_msl_max', 'pressure_msl_min',
       'relative_humidity_2m_max', 'relative_humidity_2m_min',
       'wet_bulb_temperature_2m_max', 'wet_bulb_temperature_2m_min',
       'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean',
       'month', 'year'],
      dtype='object')
No description has been provided for this image

Mann-Whitney Test for Differences:

In [70]:
# List of columns to compare
import numpy as np
columns_to_compare = ['temperature_2m_max', 'temperature_2m_min',
                      'wind_speed_10m_max', 'rain_sum',
                      'relative_humidity_2m_max', 'wet_bulb_temperature_2m_max',
                      'soil_temperature_0_to_7cm_mean']

# Loop through each column and perform the Mann-Whitney U Test
for column in columns_to_compare:
    stat, p_value = mannwhitneyu(period1[column], period2[column])
    Mean_Period_1 = np.mean(period1[column])
    Mean_Period_2 = np.mean(period2[column])
    print(f'Column: {column}')
    print(f'Mann-Whitney U Test Statistic: {stat}')
    print(f'p-value: {p_value}')
    print(f'Mean_Period_1: {Mean_Period_1}')
    print(f'Mean_Period_2: {Mean_Period_2}\n')

    # Check significance
    if p_value < 0.05:
        print(f"There is a significant difference in {column} between the two periods.")
        # Compare medians to determine which period is elevated
        if Mean_Period_1 < Mean_Period_2:
            print(f"{column} is elevated in Period 2.")
        else:
            print(f"{column} is elevated in Period 1.")
    else:
        print(f"There is no significant difference in {column} between the two periods.\n")
    print(f'Next column to be evaluated:\n')
Column: temperature_2m_max
Mann-Whitney U Test Statistic: 186.5
p-value: 0.6135340544650532
Mean_Period_1: 25.39150047302246
Mean_Period_2: 25.283384323120117

There is no significant difference in temperature_2m_max between the two periods.

Next column to be evaluated:

Column: temperature_2m_min
Mann-Whitney U Test Statistic: 168.5
p-value: 0.8253449727843322
Mean_Period_1: 23.866498947143555
Mean_Period_2: 23.766822814941406

There is no significant difference in temperature_2m_min between the two periods.

Next column to be evaluated:

Column: wind_speed_10m_max
Mann-Whitney U Test Statistic: 34.0
p-value: 0.059767510390904895
Mean_Period_1: 30.130117416381836
Mean_Period_2: 35.02366256713867

There is no significant difference in wind_speed_10m_max between the two periods.

Next column to be evaluated:

Column: rain_sum
Mann-Whitney U Test Statistic: 114.0
p-value: 0.5331918559410417
Mean_Period_1: 0.6499999761581421
Mean_Period_2: 2.3805196285247803

There is no significant difference in rain_sum between the two periods.

Next column to be evaluated:

Column: relative_humidity_2m_max
Mann-Whitney U Test Statistic: 129.0
p-value: 0.699529996606967
Mean_Period_1: 87.6551284790039
Mean_Period_2: 87.71660614013672

There is no significant difference in relative_humidity_2m_max between the two periods.

Next column to be evaluated:

Column: wet_bulb_temperature_2m_max
Mann-Whitney U Test Statistic: 153.0
p-value: 0.9937154462383508
Mean_Period_1: 22.607219696044922
Mean_Period_2: 22.634029388427734

There is no significant difference in wet_bulb_temperature_2m_max between the two periods.

Next column to be evaluated:

Column: soil_temperature_0_to_7cm_mean
Mann-Whitney U Test Statistic: 198.0
p-value: 0.49314501524676035
Mean_Period_1: 25.866500854492188
Mean_Period_2: 25.726337432861328

There is no significant difference in soil_temperature_0_to_7cm_mean between the two periods.

Next column to be evaluated:

Permutation Tests¶

Permutation tests are a non-parametric method suitable for comparing two groups without assuming normality.

Null Hypothesis: there is no significant difference in the mean values of the specified attribute (e.g., PRCP, SNOW, etc.) between the two periods.

Alternative Hypothesis: there is a significant difference in the mean values of the specified attribute between the two periods.

Test Statistic: the difference in the means of the attribute values between the two periods

Let Let $X = \{x_1,x_2,...,x_n\}$ and $Y = \{y_1,y_2,...,y_n\}$ represent the tow independent samples.

Define a test statistic $T(X,Y)$ which measures the difference between the two groups. Will consider the mean:

$$T(X,Y) = \bar{X}-\bar{Y}$$

where $\bar{X}$ and $\bar{Y}$ are the sample means of $X$ and $Y$, respectively,

Permutation Distribution

STEP 1: Combine the Data --

$$Z = X\,\cup\,Y$$

STEP 2: Generate Permutations --

Randomly shuffle the combined dataset $Z$ to generate all possible permutations (or a large subset of permutations in practice due to computational constraints). Express a single permutation as $Z^*$, which is then split into two groups:

$$Z^* = {X^*,Y^*}$$

where $X^*$ and $Y^*$ have the same sample sizes as the original $X$ and $Y$.

STEP 3: Compute the Test Statistic for each Permutation --

For each permutation $Z^*$, compute the test statistic:

$$T^* = T(X^*,Y^*)$$

STEP 4: Construct the Permutation Distribution --

The set of test statistics across all permutations forms the permutation distribution:

$$\{T_1^*, T_2^*, \dots, T_k^*\}$$

where $k$ is the total number of permutations.

P-value Calculation

The p-value to serve as the proportion of premutation test statistics that are extreme or more extreme than the observed test statistic, say, $T_{obs} = T(X,Y)$:

$$p = \frac{\#\{T^* \geq T_{\text{obs}}\}}{\text{Total permutations}}$$

For a two-tailed test:

$$p = \frac{\#\{|T^*| \geq |T_{\text{obs}}|\}}{\text{Total permutations}}$$

Significance Level: the standard 0.05 for alpha.

Decision Rule: if p-value is less than 0.05, then reject the null hypothesis; thus identifying significant difference in the respective attribute between the two periods. Else, no significant difference between the two periods.

Such to be implemented, serving as a "second opinion" model. The implementation:

In [72]:
from scipy.stats import permutation_test

# Sample data
# period1_data and period2_data should be the data for each period (e.g., temperature values)

def test_permutation(column):
    # Define the test statistic as the difference in means
    test_statistic = lambda x, y: x.mean() - y.mean()

    # Perform the permutation test
    result = permutation_test((period1[column], period2[column]), 
                              test_statistic, 
                              alternative='two-sided', 
                              n_resamples=10000, 
                              random_state=42)

    # Print the p-value
    p_value = result.pvalue
    print(f"{column} - p-value: {p_value:.4f}")

    # Check if the p-value indicates a significant difference
    if p_value < 0.05:
        print(f"Significant difference detected in {column} between the two periods.")
    else:
        print(f"No significant difference detected in {column} between the two periods.")

# Columns to test
columns = ['temperature_2m_max', 'temperature_2m_min',
                      'wind_speed_10m_max', 'rain_sum',
                      'relative_humidity_2m_max', 'wet_bulb_temperature_2m_max',
                      'soil_temperature_0_to_7cm_mean']

# Loop through each column and run the tes
for col in columns:
    test_permutation(col)
temperature_2m_max - p-value: 0.7633
No significant difference detected in temperature_2m_max between the two periods.
temperature_2m_min - p-value: 0.8561
No significant difference detected in temperature_2m_min between the two periods.
wind_speed_10m_max - p-value: 0.0774
No significant difference detected in wind_speed_10m_max between the two periods.
rain_sum - p-value: 0.4678
No significant difference detected in rain_sum between the two periods.
relative_humidity_2m_max - p-value: 0.9007
No significant difference detected in relative_humidity_2m_max between the two periods.
wet_bulb_temperature_2m_max - p-value: 0.9077
No significant difference detected in wet_bulb_temperature_2m_max between the two periods.
soil_temperature_0_to_7cm_mean - p-value: 0.5735
No significant difference detected in soil_temperature_0_to_7cm_mean between the two periods.

Extreme Value Analysis¶

To now resort to reapplying the data ranging from year 1869 to year 2022; however due to being cleaned it's shortened. Being a historical daily meteorological data set for Central Park, New York. Such data is a Kaggle data set, namely, "New York City Weather: A 154 year Retrospective". Such daily data was not initially applied to time series analysis because too much instances are missing towards performing decent time series analysis (including cointegration analysis). However, such data set can be adequate for Extreme Value Analysis (EVA).

Extreme value analysis (EVA) is essential in climate science to study rare and extreme climate events, such as heatwaves, cold spells, floods, droughts, or storms. These events have significant impacts, and EVA provides a foundation to quantify their frequency, magnitude, and associated risks.

Steps for Extreme Value Analysis in Climate Data:

  1. Data Preparation --

Choose the Variable of Interest: Common climate variables include temperature, precipitation, wind speed, sea level, etc.

Filter the Data for Extremes: Focus on the most relevant extremes. For example: High extremes: Heatwaves (extremely high temperatures over a decent period).

Low extremes: Cold spells (extremely low temperatures).

Seasonal Adjustment: Climate data often has strong seasonal trends.

De-seasonalize the data by removing the seasonal component if necessary.

  1. Select an Extreme Value Model

The GENERALIZED EXTREME VALUE (GEV) distribution is a key concept in extreme value theory (EVT), which deals with the statistical modeling of extreme deviations from the median of probability distributions. The GEV distribution is used to model the behavior of the maximum (or minimum) of a large number of random variables, and it arises naturally when considering the limiting distribution of block maxima.

The GEV distribution combines three different types of distributions that arise in extreme value theory:

Gumbel distribution (Type I) – Models light-tailed extremes (e.g., normal or exponential distributions).

Fréchet distribution (Type II) – Models heavy-tailed extremes (e.g., power-law behavior like Pareto distributions).

Weibull distribution (Type III) – Models bounded upper extremes (e.g., distributions that have an upper limit).

These three distributions are unified under the GEV through a shape parameter, $\xi$, which determines the type:

$\xi$ = 0 (Gumbel)

$\xi$ > 0 (Fréchet)

$\xi$ < 0 (Weibull)

The GEV distribution is parameterized by three values:

$\mu$ (location parameter): Determines where the distribution is centered.

$\sigma$ (scale parameter): Controls the spread or scale of the distribution.

$\xi$ (shape parameter): Determines the tail behavior, distinguishing between Gumbel, Fréchet, and Weibull types.

The cumulative distribution function (CDF) of the GEV is given by:

$$F(x; \mu, \sigma, \xi) = \exp \left\{ - \left[ 1 + \xi \left( \frac{x - \mu}{\sigma} \right) \right]^{- \frac{1}{\xi}} \right\}, \quad \text{for} \quad 1 + \xi \left( \frac{x - \mu}{\sigma} \right) > 0$$

Fits the Generalized Extreme Value (GEV) distribution to the block maxima. GEV combines three families of extreme value distributions: Gumbel, Frechet, and Weibull.

From the National Aeronautics and Space Administration (NASA) the resulting probability distribution function (PDF) for two category of shape parameter (i.e., whether it is equal to zero or not) is

$$\frac{1}{\sigma}t(x)^{x+1}e^{-t(x)}$$

where

$$t(x) = \begin{cases} (1 + \xi \frac{x-\mu}{\sigma})^{-1/\xi}, & \text{if } \xi \neq 0 \\ e^{-(x-\mu)/\sigma}, & \text{if } \xi = 0 \end{cases}$$

Python code to plot the three types of GEV densities:

In [74]:
from scipy.stats import genextreme as gev
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set Seaborn style for pretty plots
sns.set(style="whitegrid")

# Define parameters for the three types of GEV distributions
params = {
    "Gumbel (Type I)": {'shape': 0, 'loc': 0, 'scale': 1},
    "Frechet (Type II)": {'shape': 0.5, 'loc': 0, 'scale': 1},
    "Weibull (Type III)": {'shape': -0.5, 'loc': 0, 'scale': 1}
}

# Create a figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Plot GEV densities for each type
x = np.linspace(-5, 5, 1000)  # Common range for all GEV types
colors = ['coral', 'skyblue', 'limegreen']

for i, (label, param) in enumerate(params.items()):
    # Extract shape, loc, and scale
    shape = param['shape']
    loc = param['loc']
    scale = param['scale']
    
    # Generate GEV PDF
    pdf = gev.pdf(x, shape, loc=loc, scale=scale)
    
    # Plot the PDF
    ax.plot(x, pdf, label=f'{label}', color=colors[i], lw=2)

# Add labels and title
ax.set_title("GEV Distributions (Type I: Gumbel, Type II: Frechet, Type III: Weibull)", fontsize=16)
ax.set_xlabel("Value", fontsize=12)
ax.set_ylabel("Density", fontsize=12)

# Add legend
ax.legend(loc='upper right')

# Show the plot
plt.show()
No description has been provided for this image

There are two main approaches to extreme value modeling:

Block Maxima Approach --

Break your data into blocks (e.g., yearly or monthly) and only take the maximum (or minimum) value from each block.

Peak Over Threshold (POT) Approach --

Define a threshold (high quantile) and analyze values exceeding this threshold. Fits the Generalized Pareto Distribution (GPD) to exceedances over the threshold. The choice of threshold is critical; it should be high enough to capture extremes without being too high to leave out too many data points.The probability density function (PDF) of the Generalized Pareto Distribution (GPD) is given by:

$$f(x) = \begin{cases} \frac{1}{\sigma}\left(1 + \frac{\xi(x-\mu)}{\sigma}\right)^{-(1+\xi)^{-1}}, & \text{if } \xi \neq 0 \\ \frac{1}{\sigma}e^{-(x-\mu)/\sigma}, & \text{if } \xi = 0 \end{cases}$$

The Generalized Pareto Distribution (GPD) is a flexible distribution often used to model extreme value phenomena. Its probability density function (PDF) is characterized by three parameters: $\mu$, $\sigma$, and $\xi$.

In [76]:
from scipy.stats import genpareto

# Parameters for the Generalized Pareto Distribution
shape_param = 0.5  # ξ (shape parameter)
scale_param = 1.0  # σ (scale parameter)
loc_param = 0.0    # μ (location parameter)

# Generate x values
x = np.linspace(0, 10, 1000)

# PDF and CDF using scipy's genpareto
pdf_values = genpareto.pdf(x, c=shape_param, loc=loc_param, scale=scale_param)
cdf_values = genpareto.cdf(x, c=shape_param, loc=loc_param, scale=scale_param)

# Plotting the PDF
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x, pdf_values, label='PDF')
plt.title('Generalized Pareto Distribution (PDF)')
plt.xlabel('x')
plt.ylabel('Density')
plt.grid(True)
plt.legend()

# Plotting the CDF
plt.subplot(1, 2, 2)
plt.plot(x, cdf_values, label='CDF', color='orange')
plt.title('Generalized Pareto Distribution (CDF)')
plt.xlabel('x')
plt.ylabel('Cumulative Probability')
plt.grid(True)
plt.legend()

plt.tight_layout()
plt.show()
No description has been provided for this image
  1. The functions of the parameters both in GEV and GPD have the same function. Fit the Distribution GEV (Generalized Extreme Value) Distribution: In the block maxima method, you can fit the GEV distribution using Maximum Likelihood Estimation (MLE). The GEV distribution has three parameters –

Location ($\mu$): Determines the center of the distribution.

Scale ($\sigma$): Determines the spread.

Shape ($\xi$): Governs the tail behavior (whether it is heavy, light, or bounded).

GPD (Generalized Pareto Distribution). In the POT approach, fit the GPD to the excesses above the chosen threshold. The GPD distribution also has shape, scale, and threshold parameters.

  1. The return level is the value that is expected to be exceeded once on average every T years. This is crucial for risk assessment:

Return Level: For the GEV distribution, the $T$-year return level $z_T$ is computed by --

$$z_T = \mu + \frac{\xi}{\sigma} \left[ \left( -\ln\left(1 - \frac{1}{T}\right) \right)^{-\xi} - 1 \right]$$

For the GPD distribution, return levels are computed based on the scale and shape parameters for exceedances.

  1. Diagnostics and Model Checking After fitting the model, it’s important to check the fit:

Quantile-Quantile (Q-Q) Plots -- Check whether the fitted distribution matches the observed extremes.

Return Level Plot -- Plot return levels against the return periods. This helps validate that your model accurately predicts extreme events for longer return periods.

Residual Analysis -- Analyze residuals to see if they show any patterns (residuals should be randomly distributed).

  1. Interpreting the Results

Return Period -- The expected number of years between extreme events of a certain magnitude. For example, the "100-year event" refers to an event that has a 1% chance of occurring in any given year.

Probability of Exceedance -- The likelihood that an extreme event will exceed a given threshold in a particular year.

Considerations in Climate Data EVA --

  1. Stationarity: Many climate datasets are not stationary due to long-term trends (e.g., rising temperatures due to global warming). It may be necessary to de-trend the data before performing EVA.

  2. Seasonality: Climate data is highly seasonal. You may need to separate the extremes by season or adjust for seasonality.

  3. Dependence: Climate events may be temporally dependent. If extremes are clustered (e.g., heatwaves during summer), you may need to adjust for this dependence.

Block-Maxima Approach:

For the following, there's computing the temperature level that is expected to occur, on avaregage, once evey 100 years. The return period $T$ represents the average number of years you would expected between events that exceed a particular extreme value.

The return level, return_level(T) function computes the quantile associated with $1 - \frac{1}{T}$ for A GEV distribution:

$$\text{return level} = \text{GEV}.ppf(1-\frac{1}{T}, \text{shape, location, scale})$$

When $T = 100$, this quantile corresponds to $1 - \frac{1}{100} = 0.99$, conveying that there's a 99% probability that this extreme this extreeme value will not be reached in any given year; equivalently, a 1% probability it will. By setting $T = 100$ , there's observation of the $99$-percentile of extreme values (whether daily, monthly or yearly, depending on the applied block size). This percentile imples a 1% probability of occurence in any given year, being consistent with a 100-year event threshold.

Reloading the data to avoid "date" column issues:

In [81]:
import openmeteo_requests

import pandas as pd
import requests_cache
from retry_requests import retry

# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)

# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://archive-api.open-meteo.com/v1/archive"
params = {
	"latitude": 16.7425,
	"longitude": -62.1874,
	"start_date": "1980-01-08",
	"end_date": "2025-06-24",
	"daily": ["temperature_2m_mean", "temperature_2m_max", "temperature_2m_min", "apparent_temperature_mean", "apparent_temperature_max", "apparent_temperature_min", "wind_speed_10m_max", "et0_fao_evapotranspiration", "rain_sum", "dew_point_2m_max", "dew_point_2m_min", "surface_pressure_max", "surface_pressure_min", "pressure_msl_max", "pressure_msl_min", "relative_humidity_2m_max", "relative_humidity_2m_min", "wet_bulb_temperature_2m_max", "wet_bulb_temperature_2m_min", "vapour_pressure_deficit_max", "soil_temperature_0_to_7cm_mean"],
	"timezone": "auto"
}
responses = openmeteo.weather_api(url, params=params)

# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()}{response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")

# Process daily data. The order of variables needs to be the same as requested.
daily = response.Daily()
daily_temperature_2m_mean = daily.Variables(0).ValuesAsNumpy()
daily_temperature_2m_max = daily.Variables(1).ValuesAsNumpy()
daily_temperature_2m_min = daily.Variables(2).ValuesAsNumpy()
daily_apparent_temperature_mean = daily.Variables(3).ValuesAsNumpy()
daily_apparent_temperature_max = daily.Variables(4).ValuesAsNumpy()
daily_apparent_temperature_min = daily.Variables(5).ValuesAsNumpy()
daily_wind_speed_10m_max = daily.Variables(6).ValuesAsNumpy()
daily_et0_fao_evapotranspiration = daily.Variables(7).ValuesAsNumpy()
daily_rain_sum = daily.Variables(8).ValuesAsNumpy()
daily_dew_point_2m_max = daily.Variables(9).ValuesAsNumpy()
daily_dew_point_2m_min = daily.Variables(10).ValuesAsNumpy()
daily_surface_pressure_max = daily.Variables(11).ValuesAsNumpy()
daily_surface_pressure_min = daily.Variables(12).ValuesAsNumpy()
daily_pressure_msl_max = daily.Variables(13).ValuesAsNumpy()
daily_pressure_msl_min = daily.Variables(14).ValuesAsNumpy()
daily_relative_humidity_2m_max = daily.Variables(15).ValuesAsNumpy()
daily_relative_humidity_2m_min = daily.Variables(16).ValuesAsNumpy()
daily_wet_bulb_temperature_2m_max = daily.Variables(17).ValuesAsNumpy()
daily_wet_bulb_temperature_2m_min = daily.Variables(18).ValuesAsNumpy()
daily_vapour_pressure_deficit_max = daily.Variables(19).ValuesAsNumpy()
daily_soil_temperature_0_to_7cm_mean = daily.Variables(20).ValuesAsNumpy()

daily_data = {"date": pd.date_range(
	start = pd.to_datetime(daily.Time(), unit = "s", utc = True),
	end = pd.to_datetime(daily.TimeEnd(), unit = "s", utc = True),
	freq = pd.Timedelta(seconds = daily.Interval()),
	inclusive = "left"
)}

daily_data["temperature_2m_mean"] = daily_temperature_2m_mean
daily_data["temperature_2m_max"] = daily_temperature_2m_max
daily_data["temperature_2m_min"] = daily_temperature_2m_min
daily_data["apparent_temperature_mean"] = daily_apparent_temperature_mean
daily_data["apparent_temperature_max"] = daily_apparent_temperature_max
daily_data["apparent_temperature_min"] = daily_apparent_temperature_min
daily_data["wind_speed_10m_max"] = daily_wind_speed_10m_max
daily_data["et0_fao_evapotranspiration"] = daily_et0_fao_evapotranspiration
daily_data["rain_sum"] = daily_rain_sum
daily_data["dew_point_2m_max"] = daily_dew_point_2m_max
daily_data["dew_point_2m_min"] = daily_dew_point_2m_min
daily_data["surface_pressure_max"] = daily_surface_pressure_max
daily_data["surface_pressure_min"] = daily_surface_pressure_min
daily_data["pressure_msl_max"] = daily_pressure_msl_max
daily_data["pressure_msl_min"] = daily_pressure_msl_min
daily_data["relative_humidity_2m_max"] = daily_relative_humidity_2m_max
daily_data["relative_humidity_2m_min"] = daily_relative_humidity_2m_min
daily_data["wet_bulb_temperature_2m_max"] = daily_wet_bulb_temperature_2m_max
daily_data["wet_bulb_temperature_2m_min"] = daily_wet_bulb_temperature_2m_min
daily_data["vapour_pressure_deficit_max"] = daily_vapour_pressure_deficit_max
daily_data["soil_temperature_0_to_7cm_mean"] = daily_soil_temperature_0_to_7cm_mean

daily_dataframe = pd.DataFrame(data = daily_data)
print(daily_dataframe)
daily_dataframe = daily_dataframe.copy()
goody_frame = daily_dataframe.dropna()
goody_frame.info()
Coordinates 16.76625633239746°N -62.20843505859375°E
Elevation 309.0 m asl
Timezone b'America/Montserrat'b'GMT-4'
Timezone difference to GMT+0 -14400 s
                           date  temperature_2m_mean  temperature_2m_max  \
0     1980-01-08 04:00:00+00:00            23.374834           24.141499   
1     1980-01-09 04:00:00+00:00            23.264421           23.891499   
2     1980-01-10 04:00:00+00:00            22.322748           23.191502   
3     1980-01-11 04:00:00+00:00            22.587332           23.341499   
4     1980-01-12 04:00:00+00:00            21.306086           22.091499   
...                         ...                  ...                 ...   
16600 2025-06-20 04:00:00+00:00            25.351082           26.199001   
16601 2025-06-21 04:00:00+00:00            25.390665           25.898998   
16602 2025-06-22 04:00:00+00:00            25.317749           25.898998   
16603 2025-06-23 04:00:00+00:00                  NaN           25.848999   
16604 2025-06-24 04:00:00+00:00                  NaN                 NaN   

       temperature_2m_min  apparent_temperature_mean  \
0               22.191502                  22.092840   
1               22.191502                  22.358231   
2               21.341499                  21.067259   
3               21.841499                  19.905577   
4               20.541500                  19.145449   
...                   ...                        ...   
16600           24.848999                  25.104864   
16601           24.699001                  25.419016   
16602           24.449001                  24.848602   
16603           25.098999                        NaN   
16604                 NaN                        NaN   

       apparent_temperature_max  apparent_temperature_min  wind_speed_10m_max  \
0                     23.520189                 20.983297           37.212578   
1                     23.697132                 21.602598           36.896046   
2                     22.371422                 19.988932           35.654541   
3                     20.436180                 18.984425           42.072281   
4                     19.637054                 18.262983           40.104061   
...                         ...                       ...                 ...   
16600                 27.231419                 23.766788           40.882591   
16601                 27.573139                 24.278919           38.166790   
16602                 26.219694                 23.004978           44.039349   
16603                 25.357843                 23.626095           42.990990   
16604                       NaN                       NaN                 NaN   

       et0_fao_evapotranspiration  rain_sum  ...  surface_pressure_max  \
0                        3.982460       1.5  ...            983.794922   
1                        3.946293       0.8  ...            984.397400   
2                        3.259691       2.7  ...            983.913513   
3                        4.604709       0.5  ...            983.572449   
4                        2.766571       5.7  ...            982.082092   
...                           ...       ...  ...                   ...   
16600                    4.981394       0.1  ...            983.506775   
16601                    5.119689       0.0  ...            983.344971   
16602                    5.130907       1.0  ...            982.319397   
16603                         NaN       NaN  ...            981.898865   
16604                         NaN       NaN  ...                   NaN   

       surface_pressure_min  pressure_msl_max  pressure_msl_min  \
0                980.577454       1019.299988       1016.099976   
1                981.443359       1019.900024       1016.900024   
2                980.805786       1019.599976       1016.299988   
3                980.355164       1019.099976       1015.900024   
4                978.976501       1017.799988       1014.599976   
...                     ...               ...               ...   
16600            981.255981       1018.700012       1016.500000   
16601            980.240479       1018.700012       1015.400024   
16602            979.411743       1017.500000       1014.500000   
16603            979.643860       1017.099976       1014.799988   
16604                   NaN               NaN               NaN   

       relative_humidity_2m_max  relative_humidity_2m_min  \
0                     87.652779                 70.725937   
1                     87.906815                 73.156029   
2                     90.619431                 71.578697   
3                     81.800613                 61.149487   
4                     89.427284                 78.321884   
...                         ...                       ...   
16600                 86.541199                 70.866669   
16601                 85.219734                 72.591751   
16602                 86.767601                 72.591751   
16603                 84.229759                 75.320984   
16604                       NaN                       NaN   

       wet_bulb_temperature_2m_max  wet_bulb_temperature_2m_min  \
0                        21.027277                    20.169138   
1                        20.914402                    20.337797   
2                        20.636232                    18.998484   
3                        19.724335                    17.843048   
4                        19.959215                    19.202456   
...                            ...                          ...   
16600                    23.118631                    21.683819   
16601                    22.751518                    22.099451   
16602                    22.906918                    21.904879   
16603                    23.149427                    22.411777   
16604                          NaN                          NaN   

       vapour_pressure_deficit_max  soil_temperature_0_to_7cm_mean  
0                         0.880710                       24.816500  
1                         0.795568                       24.729010  
2                         0.783625                       24.678999  
3                         1.107534                       24.629000  
4                         0.576288                       24.578997  
...                            ...                             ...  
16600                     0.984500                       26.217749  
16601                     0.912614                       26.238586  
16602                     0.912614                       26.267754  
16603                     0.821694                             NaN  
16604                          NaN                             NaN  

[16605 rows x 22 columns]
<class 'pandas.core.frame.DataFrame'>
Index: 16603 entries, 0 to 16602
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype              
---  ------                          --------------  -----              
 0   date                            16603 non-null  datetime64[ns, UTC]
 1   temperature_2m_mean             16603 non-null  float32            
 2   temperature_2m_max              16603 non-null  float32            
 3   temperature_2m_min              16603 non-null  float32            
 4   apparent_temperature_mean       16603 non-null  float32            
 5   apparent_temperature_max        16603 non-null  float32            
 6   apparent_temperature_min        16603 non-null  float32            
 7   wind_speed_10m_max              16603 non-null  float32            
 8   et0_fao_evapotranspiration      16603 non-null  float32            
 9   rain_sum                        16603 non-null  float32            
 10  dew_point_2m_max                16603 non-null  float32            
 11  dew_point_2m_min                16603 non-null  float32            
 12  surface_pressure_max            16603 non-null  float32            
 13  surface_pressure_min            16603 non-null  float32            
 14  pressure_msl_max                16603 non-null  float32            
 15  pressure_msl_min                16603 non-null  float32            
 16  relative_humidity_2m_max        16603 non-null  float32            
 17  relative_humidity_2m_min        16603 non-null  float32            
 18  wet_bulb_temperature_2m_max     16603 non-null  float32            
 19  wet_bulb_temperature_2m_min     16603 non-null  float32            
 20  vapour_pressure_deficit_max     16603 non-null  float32            
 21  soil_temperature_0_to_7cm_mean  16603 non-null  float32            
dtypes: datetime64[ns, UTC](1), float32(21)
memory usage: 1.6 MB

EVA computation:

In [83]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import genextreme, genpareto

# --- Prepare data ---
goody_frame['date'] = pd.to_datetime(goody_frame['date'])
goody_frame = goody_frame.set_index('date').sort_index()

# --- Constants ---
T = 100  # Return period in years
threshold_quantile = 0.95
years = (goody_frame.index.max() - goody_frame.index.min()).days / 365.25

# --- Loop over variables ---
for col in goody_frame.columns:
    print(f"\n==========================")
    print(f"📈 Analyzing variable: {col}")
    print(f"==========================")

    data = goody_frame[col].dropna()

    # -----------------------------
    # BLOCK MAXIMA + GEV
    # -----------------------------
    block_max = data.resample('YE').max().dropna()
    c, loc_gev, scale_gev = genextreme.fit(block_max)

    # Return Level for 100-year event (GEV)
    if c != 0:
        z_gev = loc_gev + (scale_gev / c) * ((-np.log(1 - 1/T))**(-c) - 1)
    else:
        z_gev = loc_gev - scale_gev * np.log(-np.log(1 - 1/T))

    # Plot GEV
    x_gev = np.linspace(block_max.min(), block_max.max(), 100)
    pdf_gev = genextreme.pdf(x_gev, c, loc=loc_gev, scale=scale_gev)

    plt.figure(figsize=(10, 4))
    plt.hist(block_max, bins=10, density=True, alpha=0.5, label='Block Maxima')
    plt.plot(x_gev, pdf_gev, 'r-', label='GEV Fit')
    plt.axvline(z_gev, color='k', linestyle='--', label=f'100-yr RL = {z_gev:.2f}')
    plt.title(f"{col} - GEV (Block Maxima)")
    plt.xlabel(col)
    plt.ylabel("Density")
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

    # -----------------------------
    # POT + GPD
    # -----------------------------
    threshold = data.quantile(threshold_quantile)
    exceedances = data[data > threshold] - threshold
    exceedances = exceedances.dropna()

    if exceedances.empty:
        print("⚠️ No exceedances above threshold — skipping POT analysis.")
        continue

    shape_gpd, loc_gpd, scale_gpd = genpareto.fit(exceedances)
    num_exceed = exceedances.shape[0]
    n = num_exceed / years  # exceedances per year

    # Return Level for 100-year event (GPD)
    if shape_gpd != 0:
        z_gpd = threshold + (scale_gpd / shape_gpd) * ((T * n)**shape_gpd - 1)
    else:
        z_gpd = threshold + scale_gpd * np.log(T * n)

    # Plot GPD
    x_gpd = np.linspace(0, exceedances.max(), 100)
    pdf_gpd = genpareto.pdf(x_gpd, shape_gpd, loc=loc_gpd, scale=scale_gpd)

    plt.figure(figsize=(10, 4))
    plt.hist(exceedances, bins=20, density=True, alpha=0.5, label='Exceedances')
    plt.plot(x_gpd, pdf_gpd, 'r-', label='GPD Fit')
    plt.axvline(z_gpd - threshold, color='k', linestyle='--', label=f'100-yr RL = {z_gpd:.2f}')
    plt.title(f"{col} - GPD (Peaks Over Threshold)")
    plt.xlabel(f"{col} exceedances over {threshold:.2f}")
    plt.ylabel("Density")
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

    # -----------------------------
    # Summary Printout
    # -----------------------------
    print("📊 GEV Fit:")
    print(f"  Shape:     {c:.4f}")
    print(f"  Location:  {loc_gev:.2f}")
    print(f"  Scale:     {scale_gev:.2f}")
    print(f"  🎯 100-year Return Level (GEV): {z_gev:.2f}")

    print("\n📊 GPD Fit:")
    print(f"  Threshold: {threshold:.2f}")
    print(f"  Shape:     {shape_gpd:.4f}")
    print(f"  Location:  {loc_gpd:.2f}")
    print(f"  Scale:     {scale_gpd:.2f}")
    print(f"  🎯 100-year Return Level (GPD): {z_gpd:.2f}")
C:\Users\verlene\AppData\Local\Temp\ipykernel_10952\2125509235.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  goody_frame['date'] = pd.to_datetime(goody_frame['date'])
==========================
📈 Analyzing variable: temperature_2m_mean
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.0523
  Location:  26.01
  Scale:     0.50
  🎯 100-year Return Level (GEV): 28.62

📊 GPD Fit:
  Threshold: 26.04
  Shape:     -0.2914
  Location:  0.00
  Scale:     0.57
  🎯 100-year Return Level (GPD): 27.77

==========================
📈 Analyzing variable: temperature_2m_max
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     -0.3138
  Location:  26.50
  Scale:     0.58
  🎯 100-year Return Level (GEV): 27.90

📊 GPD Fit:
  Threshold: 27.55
  Shape:     -0.3587
  Location:  0.05
  Scale:     0.80
  🎯 100-year Return Level (GPD): 29.63

==========================
📈 Analyzing variable: temperature_2m_min
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.1078
  Location:  25.53
  Scale:     0.41
  🎯 100-year Return Level (GEV): 27.99

📊 GPD Fit:
  Threshold: 25.10
  Shape:     0.0217
  Location:  0.04
  Scale:     0.37
  🎯 100-year Return Level (GPD): 28.15

==========================
📈 Analyzing variable: apparent_temperature_mean
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.0894
  Location:  29.02
  Scale:     0.78
  🎯 100-year Return Level (GEV): 33.45

📊 GPD Fit:
  Threshold: 28.06
  Shape:     -0.1756
  Location:  0.00
  Scale:     0.90
  🎯 100-year Return Level (GPD): 31.81

==========================
📈 Analyzing variable: apparent_temperature_max
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.2380
  Location:  31.57
  Scale:     1.16
  🎯 100-year Return Level (GEV): 41.24

📊 GPD Fit:
  Threshold: 30.48
  Shape:     -0.2249
  Location:  0.00
  Scale:     1.16
  🎯 100-year Return Level (GPD): 34.70

==========================
📈 Analyzing variable: apparent_temperature_min
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.0871
  Location:  27.74
  Scale:     0.76
  🎯 100-year Return Level (GEV): 32.07

📊 GPD Fit:
  Threshold: 26.58
  Shape:     -0.1257
  Location:  0.00
  Scale:     0.82
  🎯 100-year Return Level (GPD): 30.59

==========================
📈 Analyzing variable: wind_speed_10m_max
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     -0.2789
  Location:  49.74
  Scale:     5.40
  🎯 100-year Return Level (GEV): 63.74

📊 GPD Fit:
  Threshold: 40.51
  Shape:     0.2245
  Location:  0.00
  Scale:     2.65
  🎯 100-year Return Level (GPD): 92.39

==========================
📈 Analyzing variable: et0_fao_evapotranspiration
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     -0.1624
  Location:  5.93
  Scale:     0.28
  🎯 100-year Return Level (GEV): 6.82

📊 GPD Fit:
  Threshold: 5.61
  Shape:     -0.1881
  Location:  0.00
  Scale:     0.38
  🎯 100-year Return Level (GPD): 7.13

==========================
📈 Analyzing variable: rain_sum
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     -0.3210
  Location:  34.74
  Scale:     18.52
  🎯 100-year Return Level (GEV): 79.26

📊 GPD Fit:
  Threshold: 8.40
  Shape:     0.3871
  Location:  0.00
  Scale:     5.36
  🎯 100-year Return Level (GPD): 247.29

==========================
📈 Analyzing variable: dew_point_2m_max
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.1250
  Location:  23.06
  Scale:     0.35
  🎯 100-year Return Level (GEV): 25.20

📊 GPD Fit:
  Threshold: 22.79
  Shape:     -0.2295
  Location:  0.01
  Scale:     0.42
  🎯 100-year Return Level (GPD): 24.28

==========================
📈 Analyzing variable: dew_point_2m_min
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.3433
  Location:  22.17
  Scale:     0.38
  🎯 100-year Return Level (GEV): 26.48

📊 GPD Fit:
  Threshold: 21.79
  Shape:     -0.1907
  Location:  0.01
  Scale:     0.33
  🎯 100-year Return Level (GPD): 23.12

==========================
📈 Analyzing variable: surface_pressure_max
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.3107
  Location:  985.25
  Scale:     0.70
  🎯 100-year Return Level (GEV): 992.42

📊 GPD Fit:
  Threshold: 983.99
  Shape:     -0.1512
  Location:  0.00
  Scale:     0.68
  🎯 100-year Return Level (GPD): 987.04

==========================
📈 Analyzing variable: surface_pressure_min
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.3398
  Location:  982.21
  Scale:     0.71
  🎯 100-year Return Level (GEV): 990.08

📊 GPD Fit:
  Threshold: 981.03
  Shape:     -0.1593
  Location:  0.00
  Scale:     0.65
  🎯 100-year Return Level (GPD): 983.90

==========================
📈 Analyzing variable: pressure_msl_max
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.3270
  Location:  1020.82
  Scale:     0.72
  🎯 100-year Return Level (GEV): 1028.54

📊 GPD Fit:
  Threshold: 1019.50
  Shape:     -0.0520
  Location:  0.10
  Scale:     0.59
  🎯 100-year Return Level (GPD): 1023.14

==========================
📈 Analyzing variable: pressure_msl_min
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.5000
  Location:  1017.71
  Scale:     0.80
  🎯 100-year Return Level (GEV): 1032.10

📊 GPD Fit:
  Threshold: 1016.40
  Shape:     -0.1092
  Location:  0.10
  Scale:     0.61
  🎯 100-year Return Level (GPD): 1019.48

==========================
📈 Analyzing variable: relative_humidity_2m_max
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.0249
  Location:  92.27
  Scale:     0.77
  🎯 100-year Return Level (GEV): 96.00

📊 GPD Fit:
  Threshold: 90.96
  Shape:     -0.0197
  Location:  0.00
  Scale:     0.55
  🎯 100-year Return Level (GPD): 94.78

==========================
📈 Analyzing variable: relative_humidity_2m_min
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.3537
  Location:  83.46
  Scale:     1.53
  🎯 100-year Return Level (GEV): 101.15

📊 GPD Fit:
  Threshold: 80.83
  Shape:     -0.1529
  Location:  0.00
  Scale:     1.34
  🎯 100-year Return Level (GPD): 86.82

==========================
📈 Analyzing variable: wet_bulb_temperature_2m_max
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.0392
  Location:  23.68
  Scale:     0.36
  🎯 100-year Return Level (GEV): 25.50

📊 GPD Fit:
  Threshold: 23.57
  Shape:     -0.2716
  Location:  0.00
  Scale:     0.45
  🎯 100-year Return Level (GPD): 25.01

==========================
📈 Analyzing variable: wet_bulb_temperature_2m_min
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     0.1621
  Location:  23.05
  Scale:     0.35
  🎯 100-year Return Level (GEV): 25.44

📊 GPD Fit:
  Threshold: 22.78
  Shape:     -0.1442
  Location:  0.00
  Scale:     0.33
  🎯 100-year Return Level (GPD): 24.29

==========================
📈 Analyzing variable: vapour_pressure_deficit_max
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     -0.4890
  Location:  1.32
  Scale:     0.11
  🎯 100-year Return Level (GEV): 1.53

📊 GPD Fit:
  Threshold: 1.42
  Shape:     -0.1940
  Location:  0.00
  Scale:     0.20
  🎯 100-year Return Level (GPD): 2.20

==========================
📈 Analyzing variable: soil_temperature_0_to_7cm_mean
==========================
No description has been provided for this image
No description has been provided for this image
📊 GEV Fit:
  Shape:     -0.7067
  Location:  26.99
  Scale:     0.69
  🎯 100-year Return Level (GEV): 27.93

📊 GPD Fit:
  Threshold: 29.88
  Shape:     -0.3008
  Location:  0.00
  Scale:     1.82
  🎯 100-year Return Level (GPD): 35.30

Outlier Detection¶

Examining the Data:

Outliers will be treated as extreme events rather than cases to disregard; there's not much depth in extensive climate variability research if extreme events are disregarded in data.

Remarks on the Outlier Detector Algorithms¶

Outliers are not things just to be casted away just because they don't fit ideal models. Outliers in this development are treated as "extreme" weather events in relation to climate variability. If weather states don't fall outside of the "ordinary", then climate change studies would have no value.

Outlier Detection by Local Outlier Factor Method¶

The Local Outlier Factor (LOF) is an unsupervised learning algorithm designed to identify anomalous data points within a dataset. It operates by comparing the local density of a data point to the density of its neighbors. It provides a convenient and efficient way to detect outliers in various applications.LOF is classified as an unsupervised learning algorithm because it does not require labeled data to identify outliers. Unlike supervised learning, which involves training a model on labeled examples to make predictions, LOF relies solely on the inherent structure and distribution of the data itself. This makes it particularly useful in scenarios where labeled data is scarce or unavailable.

For multivariate data sets The LOF algorithm does not analyse each attribute independently when determining if a row is an outlier. Instead, it considers all attributes (features) collectively and assesses how the entire row (i.e., the combination of all feature values) deviates from its neighboring rows in the multi-dimensional feature space. It's characteristics:

  1. Multi-Dimensional Analysis: LOF operates in a multi-dimensional space where each dimension corresponds to one attribute (e.g., temperature, relative humidity, pressure, etc.). It doesn't check individual attributes separately but looks at the combined values of all attributes for each row.

  2. Density-Based Approach: The algorithm calculates the local density of each data point by comparing it to its nearest neighbors. It then determines whether a point is an outlier based on the relative density compared to the densities of its neighbors. If the density of a point is significantly lower than that of its neighbors, the point is considered an outlier.

  3. Outlier Score for the Entire Row: LOF assigns an outlier score based on how the row's multi-dimensional position and density differ from those of nearby rows. It labels a row as an outlier only if its overall pattern (considering all attributes together) deviates significantly from its neighbors.

Mathematical Structure of LOF¶
  1. For dataset $X = {x_1,x_2,...,x_n}$ in a $r$-dimensional space, let $k$ be the number of nearest neighbors considered for the outlier detection.

  2. For each data point $x_i$, compute the distance to all other points. Such distance can be computed by applying any suitable metric (Eucluidean distance commonly used however):

$$d(x_i\,x_j) = \vert\,x_i\,-\,x_j\vert$$
  1. K-Nearest Neigbors(k-NN): For each point $x_i$ find its $k$ nearest neigbors $N_k(x_i)$ based on the distance calculated.

  2. Reachability Distance: The reachability distance $d_reach(x_i\,x_j)$ from point $x_i$ to point $x_j$ is defined as:

$$d_reach(x_i\,x_j) = \text{max}(d(x_i\,x_j)\,,kNNDist(x_j))$$

where $kNNDist(x_j)$ is the distance from $x_j$ to its $k$-th nearest neighbor, ensuring that the reachability distance accounts for the local density.

  1. Local Reachability Density (LRD): The LRD $\rho_{\text{L}}(x_i)$ for a point $x_i$ is computed as the inverse of the average reachability distance to its $k$ nearest neighbors:
$$\rho_{\text{L}}(x_i) = \frac{k}{\sum_{x_j \in N_k(x_i)} d_{reach}(x_i, x_j)}$$
  1. Local Outlier Factor: the LOF score, $LOF(x_i)$ for a point $x_i$ is defined as the ratio of the local reachability density of $x_i$ to the average local reachability density of its $k$ nearest neighbors:
$$LOF(x_i) = \frac{\sum_{x_j \in N_k(x_i)}\rho_{\text{L}}(x_j)}{k\cdot \rho_{\text{L}}(x_i)}$$

If $LOF(x_i) < 1$: $x_i$ is considered a normal point;

If $LOF(x_i) > 1$: $x_i$ is considered an outlier, where higher values suggest stronger outlier

Visualizations for Local Outlier Factor:

  1. Scatter Plot with Outliers Highlighted:

Visualizes the dataset and highlights the outliers identified by LOF.

  1. Decision Boundary Visualization:

Shows the regions where LOF considers points as outliers or inliers.

The following is a demonstration of LOF.

In [85]:
from sklearn.neighbors import LocalOutlierFactor

# Generate synthetic data
np.random.seed(42)
X_inliers = 0.3 * np.random.randn(100, 2)
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.concatenate([X_inliers, X_outliers], axis=0)

# Fit the LOF model
lof = LocalOutlierFactor(n_neighbors=20)
y_pred = lof.fit_predict(X)
outlier_scores = -lof.negative_outlier_factor_

# 1. Scatter Plot with Outliers Highlighted
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], color='blue', s=20, label='Inliers')
plt.scatter(X[y_pred == -1, 0], X[y_pred == -1, 1], color='red', s=50, edgecolor='k', label='Outliers')
plt.title('Local Outlier Factor (LOF) - Outliers Detection')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

# 2. Decision Boundary Visualization
xx, yy = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
grid_points = np.c_[xx.ravel(), yy.ravel()]

# Manually compute the nearest neighbors for grid points
distances = np.linalg.norm(grid_points[:, np.newaxis] - X, axis=2)
# Get indices of the nearest neighbors
neighbors = np.argsort(distances, axis=1)[:, :lof.n_neighbors]

# Compute outlier scores for grid points
outlier_scores_grid = np.array([
    -np.mean(lof.negative_outlier_factor_[neighbors[i]]) for i in range(grid_points.shape[0])
])

# Reshape the scores for contour plotting
Z = outlier_scores_grid.reshape(xx.shape)

# Plotting the decision boundary
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), Z.max(), 7), cmap=plt.cm.Blues_r)
plt.colorbar()
plt.scatter(X[:, 0], X[:, 1], c='white', s=20, edgecolor='k')
plt.scatter(X[y_pred == -1, 0], X[y_pred == -1, 1], color='red', s=50, edgecolor='k', label='Outliers')
plt.title('LOF Decision Boundary and Outliers')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
No description has been provided for this image
No description has been provided for this image

Explanation of Visuals:

  1. Scatter Plot with Outliers Highlighted:

This plot shows all data points, with blue points representing inliers and red points (larger markers) indicating the detected outliers based on LOF. The algorithm calculates the local density of each point and identifies those with significantly lower density compared to their neighbors as outliers.

  1. Decision Boundary Visualization:

The decision boundary plot shows the regions where LOF detects outliers. The background color gradient represents the level of LOF scores, with darker shades indicating areas with higher anomaly scores. Points labeled as outliers are plotted in red.

For data neighours a choice of 30-50 can help reduce the impact of seasonal intersections on outlier detection, allowing the model to differentiate between genuine outliers and typical seasonal variability.

The DataFrame generated will include a column 'LOF', which indicates outliers with -1, inliers with 1, and one can observe the outliers displayed separately beneath.

For data neighours a choice of 30-50 can help reduce the impact of seasonal intersections on outlier detection, allowing the model to differentiate between genuine outliers and typical seasonal variability.

Physical attributes in meteorology are generally coupled in weather dynamics, so outlier detection upon a sole attribute attribute may not have much meaning.

Binary Designation in Code¶

The LOF algorithm will be subjugated to a binary grouping (not representing actual LOF values, but grouping the range of LOF values into binary):

  1. A LOF value of 1 means that the observation is considered an inlier (not an outlier). It indicates that the data point is in a "normal" range compared to its neighbors.

  2. A LOF value of -1 indicates that the observation is considered an outlier. This means that this data point is significantly different from the rest of the data, based on its local density compared to its neighbors.

In [88]:
goody_frame.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 16603 entries, 1980-01-08 04:00:00+00:00 to 2025-06-22 04:00:00+00:00
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   temperature_2m_mean             16603 non-null  float32
 1   temperature_2m_max              16603 non-null  float32
 2   temperature_2m_min              16603 non-null  float32
 3   apparent_temperature_mean       16603 non-null  float32
 4   apparent_temperature_max        16603 non-null  float32
 5   apparent_temperature_min        16603 non-null  float32
 6   wind_speed_10m_max              16603 non-null  float32
 7   et0_fao_evapotranspiration      16603 non-null  float32
 8   rain_sum                        16603 non-null  float32
 9   dew_point_2m_max                16603 non-null  float32
 10  dew_point_2m_min                16603 non-null  float32
 11  surface_pressure_max            16603 non-null  float32
 12  surface_pressure_min            16603 non-null  float32
 13  pressure_msl_max                16603 non-null  float32
 14  pressure_msl_min                16603 non-null  float32
 15  relative_humidity_2m_max        16603 non-null  float32
 16  relative_humidity_2m_min        16603 non-null  float32
 17  wet_bulb_temperature_2m_max     16603 non-null  float32
 18  wet_bulb_temperature_2m_min     16603 non-null  float32
 19  vapour_pressure_deficit_max     16603 non-null  float32
 20  soil_temperature_0_to_7cm_mean  16603 non-null  float32
dtypes: float32(21)
memory usage: 1.5 MB
In [89]:
from sklearn.neighbors import LocalOutlierFactor

# Select all numeric columns in the DataFrame
columns = goody_frame.select_dtypes(include='number').columns.tolist()

# Initialize the LOF model
lof = LocalOutlierFactor(n_neighbors=30)

# Fit the model and predict outliers
goody_frame['LOF'] = lof.fit_predict(goody_frame[columns])

# Extract the outliers
outliers = goody_frame[goody_frame['LOF'] == -1]

# Display the original DataFrame with LOF results
print("DataFrame with LOF results:")
print(goody_frame)

print("\nOutliers:")
print(outliers)

# Count how often outliers occur in each column (just for display purposes here, since LOF is multivariate)
outlier_frequencies = {}
for col in columns:
    outlier_frequencies[col] = (outliers[col].notna()).sum()

# Display the outlier frequencies
print("\nOutlier frequencies per column:")
for col, freq in outlier_frequencies.items():
    print(f"{col}: {freq} outliers")
DataFrame with LOF results:
                           temperature_2m_mean  temperature_2m_max  \
date                                                                 
1980-01-08 04:00:00+00:00            23.374834           24.141499   
1980-01-09 04:00:00+00:00            23.264421           23.891499   
1980-01-10 04:00:00+00:00            22.322748           23.191502   
1980-01-11 04:00:00+00:00            22.587332           23.341499   
1980-01-12 04:00:00+00:00            21.306086           22.091499   
...                                        ...                 ...   
2025-06-18 04:00:00+00:00            25.536499           26.148998   
2025-06-19 04:00:00+00:00            25.476080           26.049000   
2025-06-20 04:00:00+00:00            25.351082           26.199001   
2025-06-21 04:00:00+00:00            25.390665           25.898998   
2025-06-22 04:00:00+00:00            25.317749           25.898998   

                           temperature_2m_min  apparent_temperature_mean  \
date                                                                       
1980-01-08 04:00:00+00:00           22.191502                  22.092840   
1980-01-09 04:00:00+00:00           22.191502                  22.358231   
1980-01-10 04:00:00+00:00           21.341499                  21.067259   
1980-01-11 04:00:00+00:00           21.841499                  19.905577   
1980-01-12 04:00:00+00:00           20.541500                  19.145449   
...                                       ...                        ...   
2025-06-18 04:00:00+00:00           24.799000                  24.706778   
2025-06-19 04:00:00+00:00           24.749001                  24.506287   
2025-06-20 04:00:00+00:00           24.848999                  25.104864   
2025-06-21 04:00:00+00:00           24.699001                  25.419016   
2025-06-22 04:00:00+00:00           24.449001                  24.848602   

                           apparent_temperature_max  apparent_temperature_min  \
date                                                                            
1980-01-08 04:00:00+00:00                 23.520189                 20.983297   
1980-01-09 04:00:00+00:00                 23.697132                 21.602598   
1980-01-10 04:00:00+00:00                 22.371422                 19.988932   
1980-01-11 04:00:00+00:00                 20.436180                 18.984425   
1980-01-12 04:00:00+00:00                 19.637054                 18.262983   
...                                             ...                       ...   
2025-06-18 04:00:00+00:00                 26.624802                 23.568531   
2025-06-19 04:00:00+00:00                 25.319586                 23.481770   
2025-06-20 04:00:00+00:00                 27.231419                 23.766788   
2025-06-21 04:00:00+00:00                 27.573139                 24.278919   
2025-06-22 04:00:00+00:00                 26.219694                 23.004978   

                           wind_speed_10m_max  et0_fao_evapotranspiration  \
date                                                                        
1980-01-08 04:00:00+00:00           37.212578                    3.982460   
1980-01-09 04:00:00+00:00           36.896046                    3.946293   
1980-01-10 04:00:00+00:00           35.654541                    3.259691   
1980-01-11 04:00:00+00:00           42.072281                    4.604709   
1980-01-12 04:00:00+00:00           40.104061                    2.766571   
...                                       ...                         ...   
2025-06-18 04:00:00+00:00           41.411346                    5.478891   
2025-06-19 04:00:00+00:00           44.003281                    5.058496   
2025-06-20 04:00:00+00:00           40.882591                    4.981394   
2025-06-21 04:00:00+00:00           38.166790                    5.119689   
2025-06-22 04:00:00+00:00           44.039349                    5.130907   

                           rain_sum  dew_point_2m_max  ...  \
date                                                   ...   
1980-01-08 04:00:00+00:00       1.5         20.241501  ...   
1980-01-09 04:00:00+00:00       0.8         20.141499  ...   
1980-01-10 04:00:00+00:00       2.7         20.141499  ...   
1980-01-11 04:00:00+00:00       0.5         18.691502  ...   
1980-01-12 04:00:00+00:00       5.7         19.341499  ...   
...                             ...               ...  ...   
2025-06-18 04:00:00+00:00       0.1         21.949001  ...   
2025-06-19 04:00:00+00:00       0.3         22.299000  ...   
2025-06-20 04:00:00+00:00       0.1         22.449001  ...   
2025-06-21 04:00:00+00:00       0.0         22.049000  ...   
2025-06-22 04:00:00+00:00       1.0         22.148998  ...   

                           surface_pressure_min  pressure_msl_max  \
date                                                                
1980-01-08 04:00:00+00:00            980.577454       1019.299988   
1980-01-09 04:00:00+00:00            981.443359       1019.900024   
1980-01-10 04:00:00+00:00            980.805786       1019.599976   
1980-01-11 04:00:00+00:00            980.355164       1019.099976   
1980-01-12 04:00:00+00:00            978.976501       1017.799988   
...                                         ...               ...   
2025-06-18 04:00:00+00:00            981.819763       1019.000000   
2025-06-19 04:00:00+00:00            981.603394       1018.500000   
2025-06-20 04:00:00+00:00            981.255981       1018.700012   
2025-06-21 04:00:00+00:00            980.240479       1018.700012   
2025-06-22 04:00:00+00:00            979.411743       1017.500000   

                           pressure_msl_min  relative_humidity_2m_max  \
date                                                                    
1980-01-08 04:00:00+00:00       1016.099976                 87.652779   
1980-01-09 04:00:00+00:00       1016.900024                 87.906815   
1980-01-10 04:00:00+00:00       1016.299988                 90.619431   
1980-01-11 04:00:00+00:00       1015.900024                 81.800613   
1980-01-12 04:00:00+00:00       1014.599976                 89.427284   
...                                     ...                       ...   
2025-06-18 04:00:00+00:00       1017.000000                 83.175407   
2025-06-19 04:00:00+00:00       1016.799988                 83.486443   
2025-06-20 04:00:00+00:00       1016.500000                 86.541199   
2025-06-21 04:00:00+00:00       1015.400024                 85.219734   
2025-06-22 04:00:00+00:00       1014.500000                 86.767601   

                           relative_humidity_2m_min  \
date                                                  
1980-01-08 04:00:00+00:00                 70.725937   
1980-01-09 04:00:00+00:00                 73.156029   
1980-01-10 04:00:00+00:00                 71.578697   
1980-01-11 04:00:00+00:00                 61.149487   
1980-01-12 04:00:00+00:00                 78.321884   
...                                             ...   
2025-06-18 04:00:00+00:00                 69.789970   
2025-06-19 04:00:00+00:00                 72.850510   
2025-06-20 04:00:00+00:00                 70.866669   
2025-06-21 04:00:00+00:00                 72.591751   
2025-06-22 04:00:00+00:00                 72.591751   

                           wet_bulb_temperature_2m_max  \
date                                                     
1980-01-08 04:00:00+00:00                    21.027277   
1980-01-09 04:00:00+00:00                    20.914402   
1980-01-10 04:00:00+00:00                    20.636232   
1980-01-11 04:00:00+00:00                    19.724335   
1980-01-12 04:00:00+00:00                    19.959215   
...                                                ...   
2025-06-18 04:00:00+00:00                    22.869625   
2025-06-19 04:00:00+00:00                    23.097523   
2025-06-20 04:00:00+00:00                    23.118631   
2025-06-21 04:00:00+00:00                    22.751518   
2025-06-22 04:00:00+00:00                    22.906918   

                           wet_bulb_temperature_2m_min  \
date                                                     
1980-01-08 04:00:00+00:00                    20.169138   
1980-01-09 04:00:00+00:00                    20.337797   
1980-01-10 04:00:00+00:00                    18.998484   
1980-01-11 04:00:00+00:00                    17.843048   
1980-01-12 04:00:00+00:00                    19.202456   
...                                                ...   
2025-06-18 04:00:00+00:00                    21.824770   
2025-06-19 04:00:00+00:00                    22.261038   
2025-06-20 04:00:00+00:00                    21.683819   
2025-06-21 04:00:00+00:00                    22.099451   
2025-06-22 04:00:00+00:00                    21.904879   

                           vapour_pressure_deficit_max  \
date                                                     
1980-01-08 04:00:00+00:00                     0.880710   
1980-01-09 04:00:00+00:00                     0.795568   
1980-01-10 04:00:00+00:00                     0.783625   
1980-01-11 04:00:00+00:00                     1.107534   
1980-01-12 04:00:00+00:00                     0.576288   
...                                                ...   
2025-06-18 04:00:00+00:00                     1.023919   
2025-06-19 04:00:00+00:00                     0.914724   
2025-06-20 04:00:00+00:00                     0.984500   
2025-06-21 04:00:00+00:00                     0.912614   
2025-06-22 04:00:00+00:00                     0.912614   

                           soil_temperature_0_to_7cm_mean  LOF  
date                                                            
1980-01-08 04:00:00+00:00                       24.816500    1  
1980-01-09 04:00:00+00:00                       24.729010    1  
1980-01-10 04:00:00+00:00                       24.678999    1  
1980-01-11 04:00:00+00:00                       24.629000    1  
1980-01-12 04:00:00+00:00                       24.578997    1  
...                                                   ...  ...  
2025-06-18 04:00:00+00:00                       26.257332    1  
2025-06-19 04:00:00+00:00                       26.226084    1  
2025-06-20 04:00:00+00:00                       26.217749    1  
2025-06-21 04:00:00+00:00                       26.238586    1  
2025-06-22 04:00:00+00:00                       26.267754    1  

[16603 rows x 22 columns]

Outliers:
                           temperature_2m_mean  temperature_2m_max  \
date                                                                 
1984-11-09 04:00:00+00:00            23.899836           24.491501   
1984-11-10 04:00:00+00:00            23.629000           24.191502   
1984-12-16 04:00:00+00:00            22.676918           22.991501   
1984-12-17 04:00:00+00:00            22.058168           22.491501   
1985-03-06 04:00:00+00:00            21.545670           22.491501   
...                                        ...                 ...   
2024-02-10 04:00:00+00:00            22.682335           23.549000   
2024-06-24 04:00:00+00:00            26.503166           27.598999   
2024-07-08 04:00:00+00:00            25.876083           26.999001   
2024-07-13 04:00:00+00:00            26.430250           27.249001   
2025-04-05 04:00:00+00:00            22.586496           24.049000   

                           temperature_2m_min  apparent_temperature_mean  \
date                                                                       
1984-11-09 04:00:00+00:00           22.941502                  24.458384   
1984-11-10 04:00:00+00:00           22.641499                  23.820053   
1984-12-16 04:00:00+00:00           22.391499                  22.100792   
1984-12-17 04:00:00+00:00           21.491501                  21.047369   
1985-03-06 04:00:00+00:00           20.491501                  20.090147   
...                                       ...                        ...   
2024-02-10 04:00:00+00:00           21.449001                  21.389212   
2024-06-24 04:00:00+00:00           23.699001                  27.134459   
2024-07-08 04:00:00+00:00           24.848999                  26.651270   
2024-07-13 04:00:00+00:00           24.598999                  27.516558   
2025-04-05 04:00:00+00:00           21.598999                  20.925303   

                           apparent_temperature_max  apparent_temperature_min  \
date                                                                            
1984-11-09 04:00:00+00:00                 26.100151                 22.819016   
1984-11-10 04:00:00+00:00                 25.583670                 22.599895   
1984-12-16 04:00:00+00:00                 23.416800                 21.296741   
1984-12-17 04:00:00+00:00                 21.927549                 19.866741   
1985-03-06 04:00:00+00:00                 21.959633                 18.182652   
...                                             ...                       ...   
2024-02-10 04:00:00+00:00                 21.765350                 20.778839   
2024-06-24 04:00:00+00:00                 28.154533                 21.732914   
2024-07-08 04:00:00+00:00                 28.911919                 24.626991   
2024-07-13 04:00:00+00:00                 29.844009                 24.726059   
2025-04-05 04:00:00+00:00                 22.513348                 19.282921   

                           wind_speed_10m_max  et0_fao_evapotranspiration  \
date                                                                        
1984-11-09 04:00:00+00:00           33.466450                    3.518929   
1984-11-10 04:00:00+00:00           27.238943                    4.253847   
1984-12-16 04:00:00+00:00           27.067116                    3.873103   
1984-12-17 04:00:00+00:00           29.070974                    3.718415   
1985-03-06 04:00:00+00:00           36.721764                    3.856235   
...                                       ...                         ...   
2024-02-10 04:00:00+00:00           37.226505                    2.653172   
2024-06-24 04:00:00+00:00           49.184483                    3.909544   
2024-07-08 04:00:00+00:00           38.647640                    3.101306   
2024-07-13 04:00:00+00:00           47.428551                    4.825108   
2025-04-05 04:00:00+00:00           41.330288                    3.863715   

                           rain_sum  dew_point_2m_max  ...  \
date                                                   ...   
1984-11-09 04:00:00+00:00  5.599999         21.741501  ...   
1984-11-10 04:00:00+00:00  0.400000         19.641499  ...   
1984-12-16 04:00:00+00:00  0.400000         18.691502  ...   
1984-12-17 04:00:00+00:00  0.100000         18.241501  ...   
1985-03-06 04:00:00+00:00  5.800000         18.841499  ...   
...                             ...               ...  ...   
2024-02-10 04:00:00+00:00  5.600000         19.348999  ...   
2024-06-24 04:00:00+00:00  1.100000         23.549000  ...   
2024-07-08 04:00:00+00:00  5.600000         23.848999  ...   
2024-07-13 04:00:00+00:00  0.200000         23.949001  ...   
2025-04-05 04:00:00+00:00  9.000000         20.199001  ...   

                           surface_pressure_min  pressure_msl_max  \
date                                                                
1984-11-09 04:00:00+00:00            969.373352       1008.299988   
1984-11-10 04:00:00+00:00            971.259155       1009.400024   
1984-12-16 04:00:00+00:00            971.710815       1009.500000   
1984-12-17 04:00:00+00:00            971.748718       1011.200012   
1985-03-06 04:00:00+00:00            976.227295       1014.799988   
...                                         ...               ...   
2024-02-10 04:00:00+00:00            979.499207       1019.400024   
2024-06-24 04:00:00+00:00            978.435974       1016.900024   
2024-07-08 04:00:00+00:00            978.742493       1017.099976   
2024-07-13 04:00:00+00:00            980.770020       1018.200012   
2025-04-05 04:00:00+00:00            980.800781       1019.799988   

                           pressure_msl_min  relative_humidity_2m_max  \
date                                                                    
1984-11-09 04:00:00+00:00       1004.400024                 91.277199   
1984-11-10 04:00:00+00:00       1006.299988                 81.392090   
1984-12-16 04:00:00+00:00       1006.900024                 78.879066   
1984-12-17 04:00:00+00:00       1007.000000                 80.762741   
1985-03-06 04:00:00+00:00       1011.799988                 85.932579   
...                                     ...                       ...   
2024-02-10 04:00:00+00:00       1015.000000                 84.616859   
2024-06-24 04:00:00+00:00       1013.400024                 92.999802   
2024-07-08 04:00:00+00:00       1013.700012                 86.671631   
2024-07-13 04:00:00+00:00       1015.799988                 89.528587   
2025-04-05 04:00:00+00:00       1016.299988                 86.528656   

                           relative_humidity_2m_min  \
date                                                  
1984-11-09 04:00:00+00:00                 75.489159   
1984-11-10 04:00:00+00:00                 73.486008   
1984-12-16 04:00:00+00:00                 71.380264   
1984-12-17 04:00:00+00:00                 64.753418   
1985-03-06 04:00:00+00:00                 77.123642   
...                                             ...   
2024-02-10 04:00:00+00:00                 74.932289   
2024-06-24 04:00:00+00:00                 74.461945   
2024-07-08 04:00:00+00:00                 81.441521   
2024-07-13 04:00:00+00:00                 78.563728   
2025-04-05 04:00:00+00:00                 73.641228   

                           wet_bulb_temperature_2m_max  \
date                                                     
1984-11-09 04:00:00+00:00                    22.308134   
1984-11-10 04:00:00+00:00                    20.816519   
1984-12-16 04:00:00+00:00                    19.954479   
1984-12-17 04:00:00+00:00                    19.500244   
1985-03-06 04:00:00+00:00                    19.741110   
...                                                ...   
2024-02-10 04:00:00+00:00                    20.353031   
2024-06-24 04:00:00+00:00                    24.447775   
2024-07-08 04:00:00+00:00                    24.511969   
2024-07-13 04:00:00+00:00                    24.641516   
2025-04-05 04:00:00+00:00                    20.940470   

                           wet_bulb_temperature_2m_min  \
date                                                     
1984-11-09 04:00:00+00:00                    20.750200   
1984-11-10 04:00:00+00:00                    19.676367   
1984-12-16 04:00:00+00:00                    19.112356   
1984-12-17 04:00:00+00:00                    17.622261   
1985-03-06 04:00:00+00:00                    18.701254   
...                                                ...   
2024-02-10 04:00:00+00:00                    19.510832   
2024-06-24 04:00:00+00:00                    22.810537   
2024-07-08 04:00:00+00:00                    22.829180   
2024-07-13 04:00:00+00:00                    23.153746   
2025-04-05 04:00:00+00:00                    19.390827   

                           vapour_pressure_deficit_max  \
date                                                     
1984-11-09 04:00:00+00:00                     0.730756   
1984-11-10 04:00:00+00:00                     0.773741   
1984-12-16 04:00:00+00:00                     0.796286   
1984-12-17 04:00:00+00:00                     0.942908   
1985-03-06 04:00:00+00:00                     0.613719   
...                                                ...   
2024-02-10 04:00:00+00:00                     0.725367   
2024-06-24 04:00:00+00:00                     0.942223   
2024-07-08 04:00:00+00:00                     0.661041   
2024-07-13 04:00:00+00:00                     0.774324   
2025-04-05 04:00:00+00:00                     0.788585   

                           soil_temperature_0_to_7cm_mean  LOF  
date                                                            
1984-11-09 04:00:00+00:00                       24.966499   -1  
1984-11-10 04:00:00+00:00                       25.066500   -1  
1984-12-16 04:00:00+00:00                       24.279005   -1  
1984-12-17 04:00:00+00:00                       24.229006   -1  
1985-03-06 04:00:00+00:00                       22.903999   -1  
...                                                   ...  ...  
2024-02-10 04:00:00+00:00                       25.515665   -1  
2024-06-24 04:00:00+00:00                       27.853163   -1  
2024-07-08 04:00:00+00:00                       27.490671   -1  
2024-07-13 04:00:00+00:00                       27.588583   -1  
2025-04-05 04:00:00+00:00                       25.355246   -1  

[85 rows x 22 columns]

Outlier frequencies per column:
temperature_2m_mean: 85 outliers
temperature_2m_max: 85 outliers
temperature_2m_min: 85 outliers
apparent_temperature_mean: 85 outliers
apparent_temperature_max: 85 outliers
apparent_temperature_min: 85 outliers
wind_speed_10m_max: 85 outliers
et0_fao_evapotranspiration: 85 outliers
rain_sum: 85 outliers
dew_point_2m_max: 85 outliers
dew_point_2m_min: 85 outliers
surface_pressure_max: 85 outliers
surface_pressure_min: 85 outliers
pressure_msl_max: 85 outliers
pressure_msl_min: 85 outliers
relative_humidity_2m_max: 85 outliers
relative_humidity_2m_min: 85 outliers
wet_bulb_temperature_2m_max: 85 outliers
wet_bulb_temperature_2m_min: 85 outliers
vapour_pressure_deficit_max: 85 outliers
soil_temperature_0_to_7cm_mean: 85 outliers

Data Preservation: The first step involves creating a copy of the original dataframe, NYC_cntrprk_meteo_data, and storing it in a new dataframe named outlier_detect. This ensures that any modifications made to outlier_detect will not affect the original data.

LOF Model Application: The Local Outlier Factor (LOF) algorithm is then applied to outlier_detect. LOF is a statistical method used to identify outliers in a dataset by comparing the local density of a data point to the density of its neighbors. The results of the LOF analysis are stored in a new column named "LOF" within outlier_detect.

Outlier Flagging: The LOF column is used to flag outliers and inliers. Outliers are typically assigned a value of -1, while inliers are assigned a value of 1. This binary classification system simplifies the process of identifying and analyzing anomalous data points.

Outlier Frequency Calculation: The code calculates the frequency of outliers for each column in a specified list. This information can be valuable for understanding which columns are more prone to outliers and for determining appropriate outlier handling strategies.

Benefits of This Approach: By using outlier_detect as a working dataframe, the original structure of the dataset is preserved. This means that the original columns and their corresponding data types remain unchanged, ensuring that subsequent analyses and visualizations are based on the original data.

Normalizing the data to prevent differences in scale from affecting density calculations.

In [91]:
from scipy.stats import zscore

# Z-score normalization
goody_frame_normalized = goody_frame.copy()
goody_frame_normalized[columns] = goody_frame[columns].apply(zscore)

# Create a copy with a different name
goody_frame_zscore = goody_frame_normalized.copy()

print(goody_frame_zscore[columns])
                           temperature_2m_mean  temperature_2m_max  \
date                                                                 
1980-01-08 04:00:00+00:00            -0.784183           -0.660480   
1980-01-09 04:00:00+00:00            -0.880507           -0.844319   
1980-01-10 04:00:00+00:00            -1.702019           -1.359068   
1980-01-11 04:00:00+00:00            -1.471197           -1.248766   
1980-01-12 04:00:00+00:00            -2.588952           -2.167964   
...                                        ...                 ...   
2025-06-18 04:00:00+00:00             1.101646            0.815751   
2025-06-19 04:00:00+00:00             1.048937            0.742217   
2025-06-20 04:00:00+00:00             0.939889            0.852522   
2025-06-21 04:00:00+00:00             0.974421            0.631912   
2025-06-22 04:00:00+00:00             0.910809            0.631912   

                           temperature_2m_min  apparent_temperature_mean  \
date                                                                       
1980-01-08 04:00:00+00:00           -1.056235                  -1.186370   
1980-01-09 04:00:00+00:00           -1.056235                  -1.065703   
1980-01-10 04:00:00+00:00           -1.811791                  -1.652678   
1980-01-11 04:00:00+00:00           -1.367347                  -2.180867   
1980-01-12 04:00:00+00:00           -2.522900                  -2.526479   
...                                       ...                        ...   
2025-06-18 04:00:00+00:00            1.261536                   0.002126   
2025-06-19 04:00:00+00:00            1.217093                  -0.089033   
2025-06-20 04:00:00+00:00            1.305980                   0.183126   
2025-06-21 04:00:00+00:00            1.172649                   0.325964   
2025-06-22 04:00:00+00:00            0.950427                   0.066610   

                           apparent_temperature_max  apparent_temperature_min  \
date                                                                            
1980-01-08 04:00:00+00:00                 -1.182242                 -1.125872   
1980-01-09 04:00:00+00:00                 -1.112033                 -0.831592   
1980-01-10 04:00:00+00:00                 -1.638058                 -1.598375   
1980-01-11 04:00:00+00:00                 -2.405936                 -2.075698   
1980-01-12 04:00:00+00:00                 -2.723019                 -2.418513   
...                                             ...                       ...   
2025-06-18 04:00:00+00:00                  0.049628                  0.102582   
2025-06-19 04:00:00+00:00                 -0.468265                  0.061354   
2025-06-20 04:00:00+00:00                  0.290326                  0.196790   
2025-06-21 04:00:00+00:00                  0.425916                  0.440145   
2025-06-22 04:00:00+00:00                 -0.111113                 -0.165208   

                           wind_speed_10m_max  et0_fao_evapotranspiration  \
date                                                                        
1980-01-08 04:00:00+00:00            0.981832                   -0.657839   
1980-01-09 04:00:00+00:00            0.933995                   -0.706750   
1980-01-10 04:00:00+00:00            0.746366                   -1.635291   
1980-01-11 04:00:00+00:00            1.716280                    0.183672   
1980-01-12 04:00:00+00:00            1.418823                   -2.302173   
...                                       ...                         ...   
2025-06-18 04:00:00+00:00            1.616393                    1.365891   
2025-06-19 04:00:00+00:00            2.008112                    0.797360   
2025-06-20 04:00:00+00:00            1.536482                    0.693091   
2025-06-21 04:00:00+00:00            1.126042                    0.880116   
2025-06-22 04:00:00+00:00            2.013563                    0.895287   

                           rain_sum  dew_point_2m_max  ...  \
date                                                   ...   
1980-01-08 04:00:00+00:00 -0.121174         -0.434958  ...   
1980-01-09 04:00:00+00:00 -0.265892         -0.502522  ...   
1980-01-10 04:00:00+00:00  0.126914         -0.502522  ...   
1980-01-11 04:00:00+00:00 -0.327914         -1.482185  ...   
1980-01-12 04:00:00+00:00  0.747134         -1.043027  ...   
...                             ...               ...  ...   
2025-06-18 04:00:00+00:00 -0.410610          0.718683  ...   
2025-06-19 04:00:00+00:00 -0.369262          0.955153  ...   
2025-06-20 04:00:00+00:00 -0.410610          1.056498  ...   
2025-06-21 04:00:00+00:00 -0.431284          0.786245  ...   
2025-06-22 04:00:00+00:00 -0.224544          0.853807  ...   

                           surface_pressure_max  surface_pressure_min  \
date                                                                    
1980-01-08 04:00:00+00:00              1.417812              1.206291   
1980-01-09 04:00:00+00:00              1.747181              1.661176   
1980-01-10 04:00:00+00:00              1.482645              1.326241   
1980-01-11 04:00:00+00:00              1.296188              1.089516   
1980-01-12 04:00:00+00:00              0.481424              0.365267   
...                                         ...                   ...   
2025-06-18 04:00:00+00:00              1.371331              1.858911   
2025-06-19 04:00:00+00:00              1.142164              1.745246   
2025-06-20 04:00:00+00:00              1.260285              1.562741   
2025-06-21 04:00:00+00:00              1.171828              1.029269   
2025-06-22 04:00:00+00:00              0.611156              0.593911   

                           pressure_msl_max  pressure_msl_min  \
date                                                            
1980-01-08 04:00:00+00:00          1.446653          1.296101   
1980-01-09 04:00:00+00:00          1.758363          1.696707   
1980-01-10 04:00:00+00:00          1.602492          1.396253   
1980-01-11 04:00:00+00:00          1.342750          1.195980   
1980-01-12 04:00:00+00:00          0.667428          0.545011   
...                                     ...               ...   
2025-06-18 04:00:00+00:00          1.290815          1.746768   
2025-06-19 04:00:00+00:00          1.031073          1.646616   
2025-06-20 04:00:00+00:00          1.134976          1.496404   
2025-06-21 04:00:00+00:00          1.134976          0.945617   
2025-06-22 04:00:00+00:00          0.511589          0.494951   

                           relative_humidity_2m_max  relative_humidity_2m_min  \
date                                                                            
1980-01-08 04:00:00+00:00                  0.608421                 -0.313262   
1980-01-09 04:00:00+00:00                  0.659651                  0.043225   
1980-01-10 04:00:00+00:00                  1.206694                 -0.188165   
1980-01-11 04:00:00+00:00                 -0.571764                 -1.718097   
1980-01-12 04:00:00+00:00                  0.966278                  0.801040   
...                                             ...                       ...   
2025-06-18 04:00:00+00:00                 -0.294514                 -0.450565   
2025-06-19 04:00:00+00:00                 -0.231789                 -0.001594   
2025-06-20 04:00:00+00:00                  0.384252                 -0.292617   
2025-06-21 04:00:00+00:00                  0.117758                 -0.039553   
2025-06-22 04:00:00+00:00                  0.429910                 -0.039553   

                           wet_bulb_temperature_2m_max  \
date                                                     
1980-01-08 04:00:00+00:00                    -0.619876   
1980-01-09 04:00:00+00:00                    -0.707428   
1980-01-10 04:00:00+00:00                    -0.923191   
1980-01-11 04:00:00+00:00                    -1.630507   
1980-01-12 04:00:00+00:00                    -1.448321   
...                                                ...   
2025-06-18 04:00:00+00:00                     0.809145   
2025-06-19 04:00:00+00:00                     0.985915   
2025-06-20 04:00:00+00:00                     1.002288   
2025-06-21 04:00:00+00:00                     0.717536   
2025-06-22 04:00:00+00:00                     0.838071   

                           wet_bulb_temperature_2m_min  \
date                                                     
1980-01-08 04:00:00+00:00                    -0.556888   
1980-01-09 04:00:00+00:00                    -0.436827   
1980-01-10 04:00:00+00:00                    -1.390226   
1980-01-11 04:00:00+00:00                    -2.212730   
1980-01-12 04:00:00+00:00                    -1.245027   
...                                                ...   
2025-06-18 04:00:00+00:00                     0.621683   
2025-06-19 04:00:00+00:00                     0.932243   
2025-06-20 04:00:00+00:00                     0.521346   
2025-06-21 04:00:00+00:00                     0.817217   
2025-06-22 04:00:00+00:00                     0.678709   

                           vapour_pressure_deficit_max  \
date                                                     
1980-01-08 04:00:00+00:00                     0.067543   
1980-01-09 04:00:00+00:00                    -0.271330   
1980-01-10 04:00:00+00:00                    -0.318863   
1980-01-11 04:00:00+00:00                     0.970315   
1980-01-12 04:00:00+00:00                    -1.144075   
...                                                ...   
2025-06-18 04:00:00+00:00                     0.637523   
2025-06-19 04:00:00+00:00                     0.202919   
2025-06-20 04:00:00+00:00                     0.480632   
2025-06-21 04:00:00+00:00                     0.194522   
2025-06-22 04:00:00+00:00                     0.194522   

                           soil_temperature_0_to_7cm_mean  
date                                                       
1980-01-08 04:00:00+00:00                       -0.691371  
1980-01-09 04:00:00+00:00                       -0.741405  
1980-01-10 04:00:00+00:00                       -0.770006  
1980-01-11 04:00:00+00:00                       -0.798600  
1980-01-12 04:00:00+00:00                       -0.827197  
...                                                   ...  
2025-06-18 04:00:00+00:00                        0.132629  
2025-06-19 04:00:00+00:00                        0.114758  
2025-06-20 04:00:00+00:00                        0.109992  
2025-06-21 04:00:00+00:00                        0.121908  
2025-06-22 04:00:00+00:00                        0.138589  

[16603 rows x 21 columns]
In [92]:
from sklearn.neighbors import LocalOutlierFactor

# Initialize the LOF model
lof = LocalOutlierFactor(n_neighbors=30)

# Create a copy of the original dataframe to avoid modifying it
# Fit the model and predict outliers on the selected columns
outlier_detect = goody_frame_zscore.copy()

# Add the LOF column with outlier predictions (-1 for outliers, 1 for inliers)
outlier_detect['LOF'] = lof.fit_predict(outlier_detect[columns])

# Extract the rows where outliers are identified (LOF == -1)
outliers = outlier_detect[outlier_detect['LOF'] == -1]

# Check LOF class counts
counts = outlier_detect['LOF'].value_counts()
print("LOF\n", counts)

# Display the original DataFrame with the new LOF column and the outliers
print("DataFrame with LOF results:")
print(outlier_detect)

print("\nOutliers:")
print(outliers)

# Count the number of outliers for each column
outlier_frequencies = {}
for col in columns:
    # Count how often outliers occur in each column
    outlier_frequencies[col] = (outliers[col].notna()).sum()

# Display the outlier frequencies
print("\nOutlier frequencies per column:")
for col, freq in outlier_frequencies.items():
    print(f"{col}: {freq} outliers")
LOF
 LOF
 1    16491
-1      112
Name: count, dtype: int64
DataFrame with LOF results:
                           temperature_2m_mean  temperature_2m_max  \
date                                                                 
1980-01-08 04:00:00+00:00            -0.784183           -0.660480   
1980-01-09 04:00:00+00:00            -0.880507           -0.844319   
1980-01-10 04:00:00+00:00            -1.702019           -1.359068   
1980-01-11 04:00:00+00:00            -1.471197           -1.248766   
1980-01-12 04:00:00+00:00            -2.588952           -2.167964   
...                                        ...                 ...   
2025-06-18 04:00:00+00:00             1.101646            0.815751   
2025-06-19 04:00:00+00:00             1.048937            0.742217   
2025-06-20 04:00:00+00:00             0.939889            0.852522   
2025-06-21 04:00:00+00:00             0.974421            0.631912   
2025-06-22 04:00:00+00:00             0.910809            0.631912   

                           temperature_2m_min  apparent_temperature_mean  \
date                                                                       
1980-01-08 04:00:00+00:00           -1.056235                  -1.186370   
1980-01-09 04:00:00+00:00           -1.056235                  -1.065703   
1980-01-10 04:00:00+00:00           -1.811791                  -1.652678   
1980-01-11 04:00:00+00:00           -1.367347                  -2.180867   
1980-01-12 04:00:00+00:00           -2.522900                  -2.526479   
...                                       ...                        ...   
2025-06-18 04:00:00+00:00            1.261536                   0.002126   
2025-06-19 04:00:00+00:00            1.217093                  -0.089033   
2025-06-20 04:00:00+00:00            1.305980                   0.183126   
2025-06-21 04:00:00+00:00            1.172649                   0.325964   
2025-06-22 04:00:00+00:00            0.950427                   0.066610   

                           apparent_temperature_max  apparent_temperature_min  \
date                                                                            
1980-01-08 04:00:00+00:00                 -1.182242                 -1.125872   
1980-01-09 04:00:00+00:00                 -1.112033                 -0.831592   
1980-01-10 04:00:00+00:00                 -1.638058                 -1.598375   
1980-01-11 04:00:00+00:00                 -2.405936                 -2.075698   
1980-01-12 04:00:00+00:00                 -2.723019                 -2.418513   
...                                             ...                       ...   
2025-06-18 04:00:00+00:00                  0.049628                  0.102582   
2025-06-19 04:00:00+00:00                 -0.468265                  0.061354   
2025-06-20 04:00:00+00:00                  0.290326                  0.196790   
2025-06-21 04:00:00+00:00                  0.425916                  0.440145   
2025-06-22 04:00:00+00:00                 -0.111113                 -0.165208   

                           wind_speed_10m_max  et0_fao_evapotranspiration  \
date                                                                        
1980-01-08 04:00:00+00:00            0.981832                   -0.657839   
1980-01-09 04:00:00+00:00            0.933995                   -0.706750   
1980-01-10 04:00:00+00:00            0.746366                   -1.635291   
1980-01-11 04:00:00+00:00            1.716280                    0.183672   
1980-01-12 04:00:00+00:00            1.418823                   -2.302173   
...                                       ...                         ...   
2025-06-18 04:00:00+00:00            1.616393                    1.365891   
2025-06-19 04:00:00+00:00            2.008112                    0.797360   
2025-06-20 04:00:00+00:00            1.536482                    0.693091   
2025-06-21 04:00:00+00:00            1.126042                    0.880116   
2025-06-22 04:00:00+00:00            2.013563                    0.895287   

                           rain_sum  dew_point_2m_max  ...  \
date                                                   ...   
1980-01-08 04:00:00+00:00 -0.121174         -0.434958  ...   
1980-01-09 04:00:00+00:00 -0.265892         -0.502522  ...   
1980-01-10 04:00:00+00:00  0.126914         -0.502522  ...   
1980-01-11 04:00:00+00:00 -0.327914         -1.482185  ...   
1980-01-12 04:00:00+00:00  0.747134         -1.043027  ...   
...                             ...               ...  ...   
2025-06-18 04:00:00+00:00 -0.410610          0.718683  ...   
2025-06-19 04:00:00+00:00 -0.369262          0.955153  ...   
2025-06-20 04:00:00+00:00 -0.410610          1.056498  ...   
2025-06-21 04:00:00+00:00 -0.431284          0.786245  ...   
2025-06-22 04:00:00+00:00 -0.224544          0.853807  ...   

                           surface_pressure_min  pressure_msl_max  \
date                                                                
1980-01-08 04:00:00+00:00              1.206291          1.446653   
1980-01-09 04:00:00+00:00              1.661176          1.758363   
1980-01-10 04:00:00+00:00              1.326241          1.602492   
1980-01-11 04:00:00+00:00              1.089516          1.342750   
1980-01-12 04:00:00+00:00              0.365267          0.667428   
...                                         ...               ...   
2025-06-18 04:00:00+00:00              1.858911          1.290815   
2025-06-19 04:00:00+00:00              1.745246          1.031073   
2025-06-20 04:00:00+00:00              1.562741          1.134976   
2025-06-21 04:00:00+00:00              1.029269          1.134976   
2025-06-22 04:00:00+00:00              0.593911          0.511589   

                           pressure_msl_min  relative_humidity_2m_max  \
date                                                                    
1980-01-08 04:00:00+00:00          1.296101                  0.608421   
1980-01-09 04:00:00+00:00          1.696707                  0.659651   
1980-01-10 04:00:00+00:00          1.396253                  1.206694   
1980-01-11 04:00:00+00:00          1.195980                 -0.571764   
1980-01-12 04:00:00+00:00          0.545011                  0.966278   
...                                     ...                       ...   
2025-06-18 04:00:00+00:00          1.746768                 -0.294514   
2025-06-19 04:00:00+00:00          1.646616                 -0.231789   
2025-06-20 04:00:00+00:00          1.496404                  0.384252   
2025-06-21 04:00:00+00:00          0.945617                  0.117758   
2025-06-22 04:00:00+00:00          0.494951                  0.429910   

                           relative_humidity_2m_min  \
date                                                  
1980-01-08 04:00:00+00:00                 -0.313262   
1980-01-09 04:00:00+00:00                  0.043225   
1980-01-10 04:00:00+00:00                 -0.188165   
1980-01-11 04:00:00+00:00                 -1.718097   
1980-01-12 04:00:00+00:00                  0.801040   
...                                             ...   
2025-06-18 04:00:00+00:00                 -0.450565   
2025-06-19 04:00:00+00:00                 -0.001594   
2025-06-20 04:00:00+00:00                 -0.292617   
2025-06-21 04:00:00+00:00                 -0.039553   
2025-06-22 04:00:00+00:00                 -0.039553   

                           wet_bulb_temperature_2m_max  \
date                                                     
1980-01-08 04:00:00+00:00                    -0.619876   
1980-01-09 04:00:00+00:00                    -0.707428   
1980-01-10 04:00:00+00:00                    -0.923191   
1980-01-11 04:00:00+00:00                    -1.630507   
1980-01-12 04:00:00+00:00                    -1.448321   
...                                                ...   
2025-06-18 04:00:00+00:00                     0.809145   
2025-06-19 04:00:00+00:00                     0.985915   
2025-06-20 04:00:00+00:00                     1.002288   
2025-06-21 04:00:00+00:00                     0.717536   
2025-06-22 04:00:00+00:00                     0.838071   

                           wet_bulb_temperature_2m_min  \
date                                                     
1980-01-08 04:00:00+00:00                    -0.556888   
1980-01-09 04:00:00+00:00                    -0.436827   
1980-01-10 04:00:00+00:00                    -1.390226   
1980-01-11 04:00:00+00:00                    -2.212730   
1980-01-12 04:00:00+00:00                    -1.245027   
...                                                ...   
2025-06-18 04:00:00+00:00                     0.621683   
2025-06-19 04:00:00+00:00                     0.932243   
2025-06-20 04:00:00+00:00                     0.521346   
2025-06-21 04:00:00+00:00                     0.817217   
2025-06-22 04:00:00+00:00                     0.678709   

                           vapour_pressure_deficit_max  \
date                                                     
1980-01-08 04:00:00+00:00                     0.067543   
1980-01-09 04:00:00+00:00                    -0.271330   
1980-01-10 04:00:00+00:00                    -0.318863   
1980-01-11 04:00:00+00:00                     0.970315   
1980-01-12 04:00:00+00:00                    -1.144075   
...                                                ...   
2025-06-18 04:00:00+00:00                     0.637523   
2025-06-19 04:00:00+00:00                     0.202919   
2025-06-20 04:00:00+00:00                     0.480632   
2025-06-21 04:00:00+00:00                     0.194522   
2025-06-22 04:00:00+00:00                     0.194522   

                           soil_temperature_0_to_7cm_mean  LOF  
date                                                            
1980-01-08 04:00:00+00:00                       -0.691371    1  
1980-01-09 04:00:00+00:00                       -0.741405    1  
1980-01-10 04:00:00+00:00                       -0.770006    1  
1980-01-11 04:00:00+00:00                       -0.798600    1  
1980-01-12 04:00:00+00:00                       -0.827197    1  
...                                                   ...  ...  
2025-06-18 04:00:00+00:00                        0.132629    1  
2025-06-19 04:00:00+00:00                        0.114758    1  
2025-06-20 04:00:00+00:00                        0.109992    1  
2025-06-21 04:00:00+00:00                        0.121908    1  
2025-06-22 04:00:00+00:00                        0.138589    1  

[16603 rows x 22 columns]

Outliers:
                           temperature_2m_mean  temperature_2m_max  \
date                                                                 
1982-02-24 04:00:00+00:00            -2.180018           -1.947357   
1984-04-30 04:00:00+00:00            -2.038255           -1.579678   
1984-07-25 04:00:00+00:00            -0.297094           -0.219263   
1984-12-16 04:00:00+00:00            -1.393043           -1.506140   
1985-07-17 04:00:00+00:00            -0.295277           -0.072191   
...                                        ...                 ...   
2024-07-03 04:00:00+00:00             1.977677            1.955559   
2024-07-13 04:00:00+00:00             1.881351            1.624647   
2025-03-20 04:00:00+00:00             0.138375            0.117163   
2025-03-21 04:00:00+00:00             0.789039            1.073129   
2025-06-11 04:00:00+00:00             1.188884            0.889289   

                           temperature_2m_min  apparent_temperature_mean  \
date                                                                       
1982-02-24 04:00:00+00:00           -1.722900                  -2.621166   
1984-04-30 04:00:00+00:00           -2.389566                  -0.549201   
1984-07-25 04:00:00+00:00            0.010427                  -1.188954   
1984-12-16 04:00:00+00:00           -0.878460                  -1.182755   
1985-07-17 04:00:00+00:00           -1.500679                  -0.330040   
...                                       ...                        ...   
2024-07-03 04:00:00+00:00            1.128202                   0.302562   
2024-07-13 04:00:00+00:00            1.083758                   1.279666   
2025-03-20 04:00:00+00:00            0.194871                   1.457534   
2025-03-21 04:00:00+00:00            0.594871                   1.108831   
2025-06-11 04:00:00+00:00            1.128202                  -0.352042   

                           apparent_temperature_max  apparent_temperature_min  \
date                                                                            
1982-02-24 04:00:00+00:00                 -1.727251                 -2.895743   
1984-04-30 04:00:00+00:00                 -0.123619                 -0.571303   
1984-07-25 04:00:00+00:00                 -1.544697                 -1.016567   
1984-12-16 04:00:00+00:00                 -1.223266                 -0.976929   
1985-07-17 04:00:00+00:00                 -0.222561                 -1.134175   
...                                             ...                       ...   
2024-07-03 04:00:00+00:00                  0.192599                  0.069899   
2024-07-13 04:00:00+00:00                  1.326968                  0.652617   
2025-03-20 04:00:00+00:00                  1.346307                  1.364430   
2025-03-21 04:00:00+00:00                  1.452034                  1.049304   
2025-06-11 04:00:00+00:00                 -0.253851                 -0.359380   

                           wind_speed_10m_max  et0_fao_evapotranspiration  \
date                                                                        
1982-02-24 04:00:00+00:00            1.087166                    1.061252   
1984-04-30 04:00:00+00:00           -2.034551                   -1.022322   
1984-07-25 04:00:00+00:00            3.467249                   -1.611040   
1984-12-16 04:00:00+00:00           -0.551453                   -0.805730   
1985-07-17 04:00:00+00:00            0.725451                   -0.050931   
...                                       ...                         ...   
2024-07-03 04:00:00+00:00            2.594206                    1.301066   
2024-07-13 04:00:00+00:00            2.525774                    0.481734   
2025-03-20 04:00:00+00:00           -3.117747                    0.105352   
2025-03-21 04:00:00+00:00           -0.357679                    0.688249   
2025-06-11 04:00:00+00:00            3.021438                    1.828514   

                           rain_sum  dew_point_2m_max  ...  \
date                                                   ...   
1982-02-24 04:00:00+00:00 -0.348588         -0.603865  ...   
1984-04-30 04:00:00+00:00  1.553419         -0.705211  ...   
1984-07-25 04:00:00+00:00  2.731837          0.848740  ...   
1984-12-16 04:00:00+00:00 -0.348588         -1.482185  ...   
1985-07-17 04:00:00+00:00  1.987573          0.274455  ...   
...                             ...               ...  ...   
2024-07-03 04:00:00+00:00  0.561068          1.765911  ...   
2024-07-13 04:00:00+00:00 -0.389936          2.069945  ...   
2025-03-20 04:00:00+00:00 -0.224544          0.279522  ...   
2025-03-21 04:00:00+00:00 -0.389936          0.414648  ...   
2025-06-11 04:00:00+00:00 -0.224544          0.617337  ...   

                           surface_pressure_min  pressure_msl_max  \
date                                                                
1982-02-24 04:00:00+00:00             -0.288667          0.667428   
1984-04-30 04:00:00+00:00              0.334069          0.511589   
1984-07-25 04:00:00+00:00             -1.367635         -0.891024   
1984-12-16 04:00:00+00:00             -3.451599         -3.644282   
1985-07-17 04:00:00+00:00              0.699240          1.238879   
...                                         ...               ...   
2024-07-03 04:00:00+00:00             -0.464696         -0.527379   
2024-07-13 04:00:00+00:00              1.307451          0.875234   
2025-03-20 04:00:00+00:00              0.174329          0.563525   
2025-03-21 04:00:00+00:00              0.920702          1.186912   
2025-06-11 04:00:00+00:00              1.557514          1.498621   

                           pressure_msl_min  relative_humidity_2m_max  \
date                                                                    
1982-02-24 04:00:00+00:00         -0.155988                  0.926413   
1984-04-30 04:00:00+00:00          0.545011                  1.311157   
1984-07-25 04:00:00+00:00         -1.357745                  1.180015   
1984-12-16 04:00:00+00:00         -3.310561                 -1.160941   
1985-07-17 04:00:00+00:00          0.695253                  1.723833   
...                                     ...                       ...   
2024-07-03 04:00:00+00:00         -0.606654                  1.027962   
2024-07-13 04:00:00+00:00          1.145889                  0.986708   
2025-03-20 04:00:00+00:00          0.194527                 -0.680470   
2025-03-21 04:00:00+00:00          0.895526                  0.156970   
2025-06-11 04:00:00+00:00          1.446344                 -0.091745   

                           relative_humidity_2m_min  \
date                                                  
1982-02-24 04:00:00+00:00                 -2.597358   
1984-04-30 04:00:00+00:00                  0.710764   
1984-07-25 04:00:00+00:00                  1.176724   
1984-12-16 04:00:00+00:00                 -0.217274   
1985-07-17 04:00:00+00:00                  0.684803   
...                                             ...   
2024-07-03 04:00:00+00:00                 -0.477517   
2024-07-13 04:00:00+00:00                  0.836517   
2025-03-20 04:00:00+00:00                 -0.944173   
2025-03-21 04:00:00+00:00                 -1.055872   
2025-06-11 04:00:00+00:00                 -0.542302   

                           wet_bulb_temperature_2m_max  \
date                                                     
1982-02-24 04:00:00+00:00                    -0.937965   
1984-04-30 04:00:00+00:00                    -1.053084   
1984-07-25 04:00:00+00:00                     0.695513   
1984-12-16 04:00:00+00:00                    -1.451995   
1985-07-17 04:00:00+00:00                     0.228041   
...                                                ...   
2024-07-03 04:00:00+00:00                     1.859969   
2024-07-13 04:00:00+00:00                     2.183517   
2025-03-20 04:00:00+00:00                     0.337831   
2025-03-21 04:00:00+00:00                     0.543531   
2025-06-11 04:00:00+00:00                     0.619203   

                           wet_bulb_temperature_2m_min  \
date                                                     
1982-02-24 04:00:00+00:00                    -3.522482   
1984-04-30 04:00:00+00:00                    -0.993538   
1984-07-25 04:00:00+00:00                     0.698083   
1984-12-16 04:00:00+00:00                    -1.309165   
1985-07-17 04:00:00+00:00                    -0.228983   
...                                                ...   
2024-07-03 04:00:00+00:00                     1.333226   
2024-07-13 04:00:00+00:00                     1.567723   
2025-03-20 04:00:00+00:00                    -0.277850   
2025-03-21 04:00:00+00:00                     0.504016   
2025-06-11 04:00:00+00:00                     0.717358   

                           vapour_pressure_deficit_max  \
date                                                     
1982-02-24 04:00:00+00:00                     1.237140   
1984-04-30 04:00:00+00:00                    -0.961819   
1984-07-25 04:00:00+00:00                    -1.093742   
1984-12-16 04:00:00+00:00                    -0.268469   
1985-07-17 04:00:00+00:00                    -0.616006   
...                                                ...   
2024-07-03 04:00:00+00:00                     1.013464   
2024-07-13 04:00:00+00:00                    -0.355880   
2025-03-20 04:00:00+00:00                     0.844324   
2025-03-21 04:00:00+00:00                     1.290921   
2025-06-11 04:00:00+00:00                     0.746506   

                           soil_temperature_0_to_7cm_mean  LOF  
date                                                            
1982-02-24 04:00:00+00:00                       -1.270411   -1  
1984-04-30 04:00:00+00:00                       -0.734257   -1  
1984-07-25 04:00:00+00:00                       -0.137349   -1  
1984-12-16 04:00:00+00:00                       -0.998759   -1  
1985-07-17 04:00:00+00:00                       -0.348237   -1  
...                                                   ...  ...  
2024-07-03 04:00:00+00:00                        0.933276   -1  
2024-07-13 04:00:00+00:00                        0.893960   -1  
2025-03-20 04:00:00+00:00                        0.146927   -1  
2025-03-21 04:00:00+00:00                        0.156459   -1  
2025-06-11 04:00:00+00:00                        0.329218   -1  

[112 rows x 22 columns]

Outlier frequencies per column:
temperature_2m_mean: 112 outliers
temperature_2m_max: 112 outliers
temperature_2m_min: 112 outliers
apparent_temperature_mean: 112 outliers
apparent_temperature_max: 112 outliers
apparent_temperature_min: 112 outliers
wind_speed_10m_max: 112 outliers
et0_fao_evapotranspiration: 112 outliers
rain_sum: 112 outliers
dew_point_2m_max: 112 outliers
dew_point_2m_min: 112 outliers
surface_pressure_max: 112 outliers
surface_pressure_min: 112 outliers
pressure_msl_max: 112 outliers
pressure_msl_min: 112 outliers
relative_humidity_2m_max: 112 outliers
relative_humidity_2m_min: 112 outliers
wet_bulb_temperature_2m_max: 112 outliers
wet_bulb_temperature_2m_min: 112 outliers
vapour_pressure_deficit_max: 112 outliers
soil_temperature_0_to_7cm_mean: 112 outliers

The LOF algorithm offers a row-by-row analysis of the meteorological data to identify outliers. This approach involves assessing the local density of each data point relative to its neighbors.

LOF's Row-by-Row Analysis:

The LOF model operates on a row-by-row basis, examining each row of data independently. Each row in your dataset (e.g., a daily weather observation) is assessed individually; weather data recorded for a specific day at a particular location, having multiple attributes. It focuses on the values in the specified columns to assess the local density of each data point. By comparing the density of a row (data point) to its neighbors, the model can determine whether it's an outlier or an inlier.

Outlier Determination:

If a row's local density is significantly lower than its neighbors, LOF labels it as an outlier (-1). Conversely, rows with similar densities to their neighbors are labeled as inliers (1). This labeling is reflected in the LOF column of your outlier_detect DataFrame.

Temporal Context and Outlier Detection:

The columns you're analyzing (PRCP, SNOW, SNWD, TMIN, and TMAX) are associated with a Datetime column, indicating that the data points correspond to specific time points. This temporal context is essential to consider when interpreting outliers.

Seasonal Variations and Outlier Identification:

Weather data often exhibits seasonal patterns. Values that might be considered outliers in one season could be perfectly normal in another. LOF, while effective for general outlier detection, doesn't inherently account for these temporal relationships. It's highly likely that a 30 neighbor consideration will not misidentify.

Developing A Classification Model Based On LOF¶

The Local Outlier Factor (LOF) method provides a valuable tool for identifying anomalous data points within a dataset. By labeling outliers as -1 and inliers as 1, LOF effectively transforms the outlier detection problem into a binary classification task. This opens the door for the application of various classification algorithms, including logistic regression and support vector machine. One can assign the -1 instances as extreme events. Assuming the case of extreme events to have one day durations, extreme events would then fall into the following classes: extreme heat day, blizzard day, extreme cold day and day of flooding; the latter term depends on a terrain that encourages and sustains water elevation. Such extreme events identification is based on the attributes of PRCP, SNOW, SNWD, TMIN, TMAX in the dataset.

Logistic Regression, a popular statistical model, can be employed to predict the probability of a data point (event) belonging to a specific class (in this case, outlier or inlier to be identified with extreme events and normal weather). By using the LOF column as the target variable and selecting relevant features from the meteorological data, one can train logistic regression models to identify outliers based on the observed characteristics.To recall the logistic regression model is of the form:

$$P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k)}}$$

where:

$$P(Y = 1 | X)\,\,\text {is the probability of the outcome being 1}.$$$$X\,\, \text {represents the vector of predictor variables} (X_1, X_2, \ldots, X_k)$$$$\beta_0\,\,\text {is the intercept.}$$$$(\beta_1, \beta_2, \ldots, \beta_k)\,\,\text {are the coefficients for the predictor variables.}$$$$(e)\,\,\text {is the standard exponential value.}$$

The log-odds (logit) transformation of the model is given by:

$$\text{logit}(P) = \log\left(\frac{P(Y = 1 | X)}{1 - P(Y = 1 | X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k$$

The performance of the logistic regression models can be evaluated using metrics such as accuracy, precision, recall, and the ROC-AUC score. These metrics provide insights into the model's ability to correctly classify outliers and inliers, as well as its sensitivity and specificity.

It's important to note that logistic regression offers a probabilistic approach to outlier detection, modeling the presence of outliers as a function of input features. This approach contrasts with Extreme Value Analysis (EVA), which focuses on the tail behavior of distributions.

Logistic regression can be sensitive to imbalanced classes. To create a balanced dataset in Python, you can employ several techniques, particularly if your data is imbalanced with respect to the target classes. One common approach is oversampling the minority class or undersampling the majority class. Will implement the latter.

Note: no feature selection to be done on the current "outlier_detect" dataframe because the attributes are significant or elementary by consensus concerning extreme weather.

In [95]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils import resample
import pandas as pd

# Transform 'LOF' Column
outlier_detect_logit = outlier_detect.copy()
outlier_detect_logit['LOF'] = outlier_detect_logit['LOF'].replace({1: 0, -1: 1})

columns_features = outlier_detect_logit.drop(columns='LOF').columns

# Count the occurrences of each unique value in 'LOF'
counts = outlier_detect_logit['LOF'].value_counts()
print(counts)

# Separate features and target
X = outlier_detect_logit[columns_features]
y = outlier_detect_logit['LOF']

print("Overall class distribution:")
print(y.value_counts())

# If there are any instances of the minority class, proceed with the split
if (y == 1).sum() > 0:
    # Train-Test Split with stratification
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    print("\nClass distribution in training set:")
    print(y_train.value_counts())
    print("\nClass distribution in test set:")
    print(y_test.value_counts())

    # If the minority class (1) is present in the training set, no need to resample
    if (y_train == 1).sum() > 0:
        X_train_balanced = X_train
        y_train_balanced = y_train
    else:
        # Upsample the minority class before splitting
        print("\nThe minority class is not present in the training set after splitting.")
        print("Adjusting by upsampling the minority class before splitting...")

        df_majority = outlier_detect[outlier_detect['LOF'] == 0]
        df_minority = outlier_detect[outlier_detect['LOF'] == 1]
        df_minority_upsampled = resample(df_minority, 
                                         replace=True, 
                                         n_samples=df_majority.shape[0], 
                                         random_state=42)

        df_balanced = pd.concat([df_majority, df_minority_upsampled])

        X = df_balanced[columns_features]
        y = df_balanced['LOF']

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

        X_train_balanced = X_train
        y_train_balanced = y_train

        print("\nClass distribution in balanced training set:")
        print(y_train_balanced.value_counts())

    # Train the Logistic Regression Model
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train_balanced, y_train_balanced)

    # Validate the model
    y_pred = model.predict(X_test)

    # Confusion Matrix
    conf_matrix = confusion_matrix(y_test, y_pred)
    print("\nConfusion Matrix:\n", conf_matrix)

    # Classification Report
    report = classification_report(y_test, y_pred)
    print("\nClassification Report:\n", report)

    # Display coefficients and intercept
    print("\nCoefficients:\n", model.coef_)
    print("Intercept:\n", model.intercept_)

else:
    print("The dataset does not contain any instances of the minority class (1). Consider collecting more data or adjusting the dataset.")
LOF
0    16491
1      112
Name: count, dtype: int64
Overall class distribution:
LOF
0    16491
1      112
Name: count, dtype: int64

Class distribution in training set:
LOF
0    13192
1       90
Name: count, dtype: int64

Class distribution in test set:
LOF
0    3299
1      22
Name: count, dtype: int64

Confusion Matrix:
 [[3298    1]
 [  19    3]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00      3299
           1       0.75      0.14      0.23        22

    accuracy                           0.99      3321
   macro avg       0.87      0.57      0.61      3321
weighted avg       0.99      0.99      0.99      3321


Coefficients:
 [[-0.4142905  -1.32995903 -0.90096399  2.00923998  0.14820063 -1.12019417
   0.68060291 -0.46578716  0.01631078  0.89305918 -0.86632205 -0.0371282
  -0.38230594  0.55503656 -0.23326582 -0.20887113 -0.26927343  1.75853439
  -0.50782609  0.30522694  0.54741553]]
Intercept:
 [-6.08772099]

For the general LOF class distribution only about 0.67% of samples belong to the minority class (1). This imbalance has a significant impact on evaluation metrics. There is also high observed imbalance for both the training set and test set.

CONFUSION MATRIX:

True Negatives (TN): 3298

False Positives (FP): 1

False Negatives (FN): 19

True Positives (TP): 3

Hence, the model misses most of the rare class (high false negative rate), which is expected due to imbalance.

CLASSIFICATION REPORT:

Precision (class 1): 75% → When it predicts 1, it's correct 75% of the time.

Recall (class 1): 14% → It captures only 14% of the actual positives, which is quite poor.

F1-score (class 1): 0.23 → Harmonic mean of precision and recall, very low due to poor recall.

Macro avg: Average across classes, treating them equally.

Weighted avg: Takes class imbalance into account.

ACCURACY: 0.99 This is misleading in imbalanced problems. Predicting everything as class 0 would already give you ~99% accuracy.

COEFFICIENTS:

Features with larger magnitude coefficients have more influence.

Positive coefficients increase the log-odds of class 1.

E.g., feature #4 with weight 2.009 and feature #18 with 1.758 contribute most positively toward classifying as class 1.

The intercept of -6.09 implies a low base probability of class 1 before feature effects are added.

SUMMARY -- The model is accurate, but not useful for minority class detection:

Only 3 out of 22 actual class-1 samples were detected.

It may be heavily biased toward predicting class 0.

Observing the disproportionate orientation (whether general sample, training set or test set), SMOTE or undersample or other things likely will not be much helpful. The model generally isn't reliable.

Identifying the Rate of Outliers (Extreme Events) Based On LOF¶

Identifying Outliers Over Time¶

Visualizing the temporal distribution of outliers provides valuable insights into the underlying patterns and trends in a dataset. By examining how the number of outliers changes over time, you can identify seasonal variations, anomalies, or broader trends that may be indicative of underlying factors or changes in the data-generating process.

Key Observations

  1. Seasonal Patterns: If the number of outliers consistently increases or decreases during specific months or seasons, it suggests a seasonal influence on the data. This could be due to factors such as weather patterns, economic cycles, or human behavior.

  2. Anomalies: Outliers that deviate significantly from the expected seasonal patterns or overall trend might indicate unusual events or data errors. These anomalies could be further investigated to understand their underlying causes.

  3. Trends: Observing increasing or decreasing trends in the number of outliers over time can reveal broader changes in the data-generating process. This might be indicative of shifts in underlying conditions, such as changes in climate, economic conditions, or technological advancements.

For the island of Montserrat will now observe outlier counts in the months of July and September. The two most influential weather months for Montserrat are July and September. July marks the start of the rainy season, which continues through November, and September is the wettest month. While Montserrat experiences warm, tropical weather year-round, these months significantly impact rainfall patterns and potential for hurricanes. To be accomplished with a trend model based on the "Prophet" algorithm.

In [98]:
from prophet import Prophet

outlier_detect.info()
outlier_detect_reset = outlier_detect.reset_index()
print(outlier_detect_reset.columns)

# Ensure 'date' column exists and is datetime
outlier_detect_reset['date'] = pd.to_datetime(outlier_detect_reset['date'])

# Extract month and year if not already done
outlier_detect_reset['month'] = outlier_detect_reset['date'].dt.month
outlier_detect_reset['year'] = outlier_detect_reset['date'].dt.year


# Filter for outliers in January and July only
july_sep_outliers = outlier_detect_reset[(goody_frame_zscore['LOF'] == 1) & (outlier_detect_reset['month'].isin([7, 9]))]

# Check if there are any rows in the filtered dataset
if july_sep_outliers.empty:
    print("No outliers found for January and July. Please check the dataset or filtering criteria.")
else:
    # Group by year and month to count outliers
    outlier_counts = july_sep_outliers.groupby(['year', 'month']).size().reset_index(name='outlier_count')

    # Create a 'date' column for Prophet, using the 'year' and 'month'
    outlier_counts['date'] = pd.to_datetime(outlier_counts[['year', 'month']].assign(day=1))

    # Prepare the data for Prophet (requires 'ds' and 'y' column names)
    prophet_data = outlier_counts[['date', 'outlier_count']]
    prophet_data.columns = ['ds', 'y']

    # Check if prophet_data has at least two non-NaN rows
    if prophet_data.dropna().shape[0] < 2:
        print("The dataset has less than 2 non-NaN rows. Not enough data for Prophet model.")
    else:
        # Define and fit the Prophet model
        model = Prophet()
        model.fit(prophet_data)

        # Create a DataFrame for future predictions (e.g., 24 months ahead to include future January and July)
        future_dates = model.make_future_dataframe(periods=24, freq='ME')

        # Forecast
        forecast = model.predict(future_dates)

        # Filter the forecasted data for only January and July
        forecast_july_sep = forecast[(forecast['ds'].dt.month.isin([1, 7]))]

        # Merge the actual and forecasted data for visualization
        merged_data = pd.concat([prophet_data, forecast_july_sep[['ds', 'yhat']].rename(columns={'yhat': 'outlier_count'})])

        plt.figure(figsize=(10, 6))  # Adjust the figure size to be wider and shorter
        plt.plot(prophet_data['ds'], prophet_data['y'], marker='o', linestyle='-', label='Actual Outlier Count')
        plt.plot(forecast_july_sep['ds'], forecast_july_sep['yhat'], marker='o', linestyle='--',
                 label='Forecasted Outlier Count', color='orange')
        plt.xlabel('Date')
        plt.ylabel('Outlier Count')
        plt.title('Outlier Counts for January and July Over the Years with Forecast')
        plt.legend()
        plt.grid(True)  # Optional: Adds grid lines for better readability
        plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 16603 entries, 1980-01-08 04:00:00+00:00 to 2025-06-22 04:00:00+00:00
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   temperature_2m_mean             16603 non-null  float32
 1   temperature_2m_max              16603 non-null  float32
 2   temperature_2m_min              16603 non-null  float32
 3   apparent_temperature_mean       16603 non-null  float32
 4   apparent_temperature_max        16603 non-null  float32
 5   apparent_temperature_min        16603 non-null  float32
 6   wind_speed_10m_max              16603 non-null  float32
 7   et0_fao_evapotranspiration      16603 non-null  float32
 8   rain_sum                        16603 non-null  float32
 9   dew_point_2m_max                16603 non-null  float32
 10  dew_point_2m_min                16603 non-null  float32
 11  surface_pressure_max            16603 non-null  float32
 12  surface_pressure_min            16603 non-null  float32
 13  pressure_msl_max                16603 non-null  float32
 14  pressure_msl_min                16603 non-null  float32
 15  relative_humidity_2m_max        16603 non-null  float32
 16  relative_humidity_2m_min        16603 non-null  float32
 17  wet_bulb_temperature_2m_max     16603 non-null  float32
 18  wet_bulb_temperature_2m_min     16603 non-null  float32
 19  vapour_pressure_deficit_max     16603 non-null  float32
 20  soil_temperature_0_to_7cm_mean  16603 non-null  float32
 21  LOF                             16603 non-null  int32  
dtypes: float32(21), int32(1)
memory usage: 1.5 MB
Index(['date', 'temperature_2m_mean', 'temperature_2m_max',
       'temperature_2m_min', 'apparent_temperature_mean',
       'apparent_temperature_max', 'apparent_temperature_min',
       'wind_speed_10m_max', 'et0_fao_evapotranspiration', 'rain_sum',
       'dew_point_2m_max', 'dew_point_2m_min', 'surface_pressure_max',
       'surface_pressure_min', 'pressure_msl_max', 'pressure_msl_min',
       'relative_humidity_2m_max', 'relative_humidity_2m_min',
       'wet_bulb_temperature_2m_max', 'wet_bulb_temperature_2m_min',
       'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean', 'LOF'],
      dtype='object')
No outliers found for January and July. Please check the dataset or filtering criteria.
C:\Users\verlene\AppData\Local\Temp\ipykernel_10952\891376500.py:16: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  july_sep_outliers = outlier_detect_reset[(goody_frame_zscore['LOF'] == 1) & (outlier_detect_reset['month'].isin([7, 9]))]

No outliers have been detected, hence will pursue an alternative outlier detector in the near future concerning outlier counts. Recalling from the LOF development, the outlier count was quite poor, hence such should be expected.

Outlier Detection Based Histogram¶

Histogram-Based Outlier Score (HBOS) is an efficient, univariate outlier detection method that works by analyzing the distribution of each feature (attribute) independently. Unlike algorithms like LOF that consider all attributes together in a multi-dimensional space, HBOS looks at each attribute separately, making it computationally efficient and suitable for high-dimensional datasets.

Mathematical Structure for steps in the HBOS Process¶

1. Construct the Histogram:

Given a dataset $X = {x_1,x_2,...,x_n}$, chhose a number of bins $k$ and compute the histogram:

$$H = {h_1,h_2,...,h_k}$$

where $h_j$ represents the count (or frequency) of data points falling into the $j$-th bin.

2. Calculate Bin Width:

The width of each bin $w$ can be computed as:

$$w = \frac{\text{max}(X) - \text{min}(X)}{k}$$

3. Estimate Probability Density:

The probability density function (PDF) for each bin:

$$p_j = \frac{h_j}{n\cdot\,w}$$

where $p_j$ is the probability density of the $j$-th bin.

4. Outlier Score Calculation:

For each data point $x_i$, determine the bin $b_i$ it fall into. The outlier score $O(x_i)$ for that point can be identified as:

$$O(x_i) = \frac{1}{p_{b_i}}\,\,\,\,\,\,\text{if}\,\,\,p_{b_i}\,\,>\,\,0$$

If $p_{b_i}=0$, namely, the bin is empty, assign a high score:

$$O(x_i) = \infty$$

5. Normalization (optional):

Normalize the outlier scores to a designed range (say, 0 to 1) for interpretation:

$$O_{\text{norm}}(x_i)=\frac{O(x_i) - O_{\text{min}}}{O_{\text{max}}-O_{\text{min}}}$$

where $O_{\text{min}}$ and $O_{\text{max}}$ are the minimum and maximum outlier scores across all data points.

6. Interpretation:

  1. High Outlier Score indicates that the data point is rare or unusual compared to the distribution of the rest of the data.

  2. Low Outlier Score: conveys that the data point is common and falls within the expected range of the data distribution.

The characteristics of HBOS:¶

Univariate Analysis:

HBOS treats each feature independently, meaning it evaluates the distribution of each attribute (e.g., temperature, humidity) separately rather than considering the interaction between multiple attributes.

Histogram Construction:

  1. For each feature, HBOS constructs a histogram to approximate its distribution. The dataset is divided into a number of bins (intervals), and the frequency (or density) of data points in each bin is calculated.

  2. The bins might be equidistant (fixed width) or variable in width, depending on the distribution of the data.

Calculating Outlier Scores:

  1. After constructing histograms for each feature, HBOS assigns an outlier score for each value in the dataset based on the inverse density of the bin in which the value falls.

  2. If a value falls into a bin with low density (fewer data points), it receives a higher outlier score, indicating it is an anomaly for that particular feature.

  3. Conversely, if the value falls into a bin with high density, it gets a lower score, indicating it is more common.

Aggregating Scores Across Features:

  1. Since HBOS evaluates each feature separately, it aggregates the outlier scores from all features to determine the final outlier score for each row.

  2. Aggregation can be done in various ways, but a common approach is to multiply the outlier scores of each feature. This approach assumes independence between features, which may not always be true but keeps the method simple and efficient.

Contamination:

  1. In the context of the HBOS (and other anomaly detection models), contamination is a parameter that represents the expected proportion of outliers in the dataset. It is a way for the algorithm to understand how many data points it should consider as outliers.

  2. The parameter guides the model on how many data points should be classified as outliers. It helps the algorithm determine a threshold for the outlier scores to classify points as either outliers or inliers. When contamination = 0.05, it means the model assumes that 5% of the data are outliers.

  3. The contamination is always a decimal; it's not the outlier (1) - inlier (0) classification.

The value of contamination resides from 0 to 1. A value like 0.05 (5%) indicates that you expect around 5% of your dataset to be anomalous.

A value of 0 (contamination score, not class) doesn't make sense in practice.

A value of 1 (contamination score, not class) would mean all points are outliers (also impractical).

  1. The Outlier Score, different to the contanimation has the general rule of thumb:

An outlier score from 0 and 0.5 conveys an inlier. Anything above such range to be dubbed an outlier; higher values beyond 0.5 indicate stronger devations from the distribution.

Steps in the Process¶

  1. Data Preparation

Will be similar to what was done with LOF.

  1. Normalize the Data

Normalize only the relevant numerical columns that will be used for HBOS (e.g., temperature, humidity, pressure, wind speed, etc.).

  1. HBOS Algorithm Implementation

Create a binary classification column called HBOS_class based on the results.

  1. Logistic Regression

Using the HBOS_class as the target, perform logistic regression to predict the probability of a row (or weather state) being classified as an outlier.

Demonstration with HBOS¶

To now demonstrate HBOS using histograms with conveyance of how contamination (the expected proportion of outliers) affects the results.

In [102]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from pyod.models.hbos import HBOS
from pyod.utils.data import generate_data
from pyod.utils.utility import standardizer

# Step 1: Generate Sample Data (replace this with your dataset if needed)
X_train, X_test, y_train, y_test = generate_data(
    n_train=200, n_test=100, n_features=2, contamination=0.1, random_state=42
)

# Step 2: Standardize the data
X_train_norm, X_test_norm = standardizer(X_train, X_test)

# Step 3: Initialize HBOS with a contamination level (i.e., the expected fraction of outliers)
contamination = 0.1  # 10% expected outliers
hbos = HBOS(contamination=contamination)

# Step 4: Train the HBOS model
hbos.fit(X_train_norm)

# Step 5: Get Outlier Scores for the test data
y_test_scores = hbos.decision_function(X_test_norm)  # higher scores indicate more abnormal

# Step 6: Predict outliers
y_test_pred = hbos.predict(X_test_norm)  # 1 indicates an outlier, 0 indicates an inlier

# Step 7: Visualize the histograms for both features

fig, axs = plt.subplots(1, 2, figsize=(12, 5))

# Histogram for the first feature
axs[0].hist(X_train[:, 0], bins=20, color='lightblue', edgecolor='black')
axs[0].set_title('Histogram of Feature 1')
axs[0].set_xlabel('Feature 1 Values')
axs[0].set_ylabel('Frequency')

# Histogram for the second feature
axs[1].hist(X_train[:, 1], bins=20, color='lightgreen', edgecolor='black')
axs[1].set_title('Histogram of Feature 2')
axs[1].set_xlabel('Feature 2 Values')
axs[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# Step 8: Visualize the HBOS results on a scatter plot, showing outliers

plt.figure(figsize=(8, 6))
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test_pred, cmap='coolwarm', marker='o', edgecolor='k')
plt.title('HBOS: Outlier Detection (Red = Outlier, Blue = Inlier)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Outlier/ Inlier')
plt.show()

print("Outlier Scores:", y_test_scores)
No description has been provided for this image
No description has been provided for this image
Outlier Scores: [1.636894   1.636894   3.25528788 1.636894   1.636894   0.35447232
 0.64240801 1.34895831 0.35447232 0.35447232 1.34895831 2.21636348
 0.35447232 2.21636348 2.21636348 1.34895831 2.26080189 0.35447232
 1.9284278  0.35447232 1.636894   1.636894   0.64240801 3.34840259
 0.35447232 2.26080189 2.21636348 1.9284278  0.35447232 0.35447232
 2.26080189 0.64240801 3.25528788 0.35447232 1.9284278  0.35447232
 0.35447232 1.34895831 0.64240801 3.2600372  2.21636348 2.21636348
 0.64240801 0.64240801 0.35447232 1.34895831 0.35447232 0.35447232
 3.83475736 0.64240801 0.64240801 0.64240801 1.9284278  1.34895831
 0.35447232 0.35447232 0.64240801 1.9284278  0.64240801 0.35447232
 1.636894   0.64240801 0.64240801 1.34895831 3.2600372  2.21636348
 2.21636348 2.26080189 3.83475736 1.636894   0.35447232 0.64240801
 0.35447232 3.2600372  0.35447232 1.9284278  0.35447232 3.25528788
 0.35447232 0.64240801 1.9284278  0.35447232 0.35447232 0.64240801
 4.25452319 3.83475736 0.64240801 2.21636348 0.35447232 0.35447232
 3.60424531 3.16060346 3.60424531 6.03202404 5.96603178 5.96603178
 5.86659804 6.35359539 3.97468511 6.13145778]

Now, to proceed with the Montserrat based data.

In [104]:
from sklearn.preprocessing import StandardScaler
from pyod.models.hbos import HBOS
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


HBOS_data = goody_frame.copy()
# Exclude columns for normalization
columns_to_exclude_normalization = ['date']

#Exclude columns from HBOS subjugation
columns_to_exclude_hbos = ['date']

# Get the columns to normalize and apply HBOS
columns_for_hbos = [col for col in HBOS_data.columns if col not in columns_to_exclude_hbos]
columns_for_normalization = [col for col in HBOS_data.columns if col not in columns_to_exclude_normalization]

# Normalize relevant columns
scaler = StandardScaler()
HBOS_data[columns_for_normalization] = scaler.fit_transform(HBOS_data[columns_for_normalization])

# Select the data for HBOS
X_hbos = HBOS_data[columns_for_hbos]

# Apply the HBOS algorithm
hbos = HBOS(contamination=0.05)
hbos.fit(X_hbos)

# Add the HBOS classification column
HBOS_data['HBOS_class'] = hbos.labels_  # 0 for inliers, 1 for outliers

# Show dataframe 
print(HBOS_data)

# Step 5: Logistic Regression
# Define features (excluding 'HBOS_class') and the target ('HBOS_class')
X = HBOS_data[columns_for_hbos]
y = HBOS_data['HBOS_class']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
                           temperature_2m_mean  temperature_2m_max  \
date                                                                 
1980-01-08 04:00:00+00:00            -0.784182           -0.660480   
1980-01-09 04:00:00+00:00            -0.880505           -0.844319   
1980-01-10 04:00:00+00:00            -1.702018           -1.359068   
1980-01-11 04:00:00+00:00            -1.471196           -1.248766   
1980-01-12 04:00:00+00:00            -2.588951           -2.167963   
...                                        ...                 ...   
2025-06-18 04:00:00+00:00             1.101648            0.815752   
2025-06-19 04:00:00+00:00             1.048938            0.742217   
2025-06-20 04:00:00+00:00             0.939890            0.852522   
2025-06-21 04:00:00+00:00             0.974423            0.631912   
2025-06-22 04:00:00+00:00             0.910811            0.631912   

                           temperature_2m_min  apparent_temperature_mean  \
date                                                                       
1980-01-08 04:00:00+00:00           -1.056235                  -1.186370   
1980-01-09 04:00:00+00:00           -1.056235                  -1.065703   
1980-01-10 04:00:00+00:00           -1.811791                  -1.652677   
1980-01-11 04:00:00+00:00           -1.367347                  -2.180867   
1980-01-12 04:00:00+00:00           -2.522900                  -2.526479   
...                                       ...                        ...   
2025-06-18 04:00:00+00:00            1.261537                   0.002126   
2025-06-19 04:00:00+00:00            1.217093                  -0.089033   
2025-06-20 04:00:00+00:00            1.305981                   0.183126   
2025-06-21 04:00:00+00:00            1.172650                   0.325964   
2025-06-22 04:00:00+00:00            0.950428                   0.066610   

                           apparent_temperature_max  apparent_temperature_min  \
date                                                                            
1980-01-08 04:00:00+00:00                 -1.182242                 -1.125873   
1980-01-09 04:00:00+00:00                 -1.112033                 -0.831593   
1980-01-10 04:00:00+00:00                 -1.638058                 -1.598376   
1980-01-11 04:00:00+00:00                 -2.405937                 -2.075699   
1980-01-12 04:00:00+00:00                 -2.723019                 -2.418514   
...                                             ...                       ...   
2025-06-18 04:00:00+00:00                  0.049628                  0.102581   
2025-06-19 04:00:00+00:00                 -0.468265                  0.061353   
2025-06-20 04:00:00+00:00                  0.290325                  0.196789   
2025-06-21 04:00:00+00:00                  0.425916                  0.440144   
2025-06-22 04:00:00+00:00                 -0.111114                 -0.165209   

                           wind_speed_10m_max  et0_fao_evapotranspiration  \
date                                                                        
1980-01-08 04:00:00+00:00            0.981832                   -0.657839   
1980-01-09 04:00:00+00:00            0.933995                   -0.706751   
1980-01-10 04:00:00+00:00            0.746366                   -1.635292   
1980-01-11 04:00:00+00:00            1.716280                    0.183671   
1980-01-12 04:00:00+00:00            1.418822                   -2.302173   
...                                       ...                         ...   
2025-06-18 04:00:00+00:00            1.616393                    1.365890   
2025-06-19 04:00:00+00:00            2.008112                    0.797359   
2025-06-20 04:00:00+00:00            1.536482                    0.693090   
2025-06-21 04:00:00+00:00            1.126042                    0.880115   
2025-06-22 04:00:00+00:00            2.013563                    0.895286   

                           rain_sum  dew_point_2m_max  ...  pressure_msl_max  \
date                                                   ...                     
1980-01-08 04:00:00+00:00 -0.121174         -0.434959  ...          1.446654   
1980-01-09 04:00:00+00:00 -0.265892         -0.502523  ...          1.758364   
1980-01-10 04:00:00+00:00  0.126914         -0.502523  ...          1.602493   
1980-01-11 04:00:00+00:00 -0.327914         -1.482186  ...          1.342751   
1980-01-12 04:00:00+00:00  0.747134         -1.043028  ...          0.667429   
...                             ...               ...  ...               ...   
2025-06-18 04:00:00+00:00 -0.410610          0.718682  ...          1.290816   
2025-06-19 04:00:00+00:00 -0.369262          0.955152  ...          1.031074   
2025-06-20 04:00:00+00:00 -0.410610          1.056497  ...          1.134977   
2025-06-21 04:00:00+00:00 -0.431284          0.786244  ...          1.134977   
2025-06-22 04:00:00+00:00 -0.224544          0.853806  ...          0.511590   

                           pressure_msl_min  relative_humidity_2m_max  \
date                                                                    
1980-01-08 04:00:00+00:00          1.296111                  0.608420   
1980-01-09 04:00:00+00:00          1.696717                  0.659651   
1980-01-10 04:00:00+00:00          1.396262                  1.206694   
1980-01-11 04:00:00+00:00          1.195990                 -0.571764   
1980-01-12 04:00:00+00:00          0.545021                  0.966278   
...                                     ...                       ...   
2025-06-18 04:00:00+00:00          1.746777                 -0.294515   
2025-06-19 04:00:00+00:00          1.646626                 -0.231789   
2025-06-20 04:00:00+00:00          1.496414                  0.384252   
2025-06-21 04:00:00+00:00          0.945627                  0.117757   
2025-06-22 04:00:00+00:00          0.494960                  0.429910   

                           relative_humidity_2m_min  \
date                                                  
1980-01-08 04:00:00+00:00                 -0.313262   
1980-01-09 04:00:00+00:00                  0.043225   
1980-01-10 04:00:00+00:00                 -0.188164   
1980-01-11 04:00:00+00:00                 -1.718097   
1980-01-12 04:00:00+00:00                  0.801040   
...                                             ...   
2025-06-18 04:00:00+00:00                 -0.450565   
2025-06-19 04:00:00+00:00                 -0.001594   
2025-06-20 04:00:00+00:00                 -0.292617   
2025-06-21 04:00:00+00:00                 -0.039553   
2025-06-22 04:00:00+00:00                 -0.039553   

                           wet_bulb_temperature_2m_max  \
date                                                     
1980-01-08 04:00:00+00:00                    -0.619877   
1980-01-09 04:00:00+00:00                    -0.707429   
1980-01-10 04:00:00+00:00                    -0.923192   
1980-01-11 04:00:00+00:00                    -1.630508   
1980-01-12 04:00:00+00:00                    -1.448322   
...                                                ...   
2025-06-18 04:00:00+00:00                     0.809145   
2025-06-19 04:00:00+00:00                     0.985914   
2025-06-20 04:00:00+00:00                     1.002287   
2025-06-21 04:00:00+00:00                     0.717535   
2025-06-22 04:00:00+00:00                     0.838071   

                           wet_bulb_temperature_2m_min  \
date                                                     
1980-01-08 04:00:00+00:00                    -0.556889   
1980-01-09 04:00:00+00:00                    -0.436828   
1980-01-10 04:00:00+00:00                    -1.390226   
1980-01-11 04:00:00+00:00                    -2.212730   
1980-01-12 04:00:00+00:00                    -1.245028   
...                                                ...   
2025-06-18 04:00:00+00:00                     0.621683   
2025-06-19 04:00:00+00:00                     0.932243   
2025-06-20 04:00:00+00:00                     0.521346   
2025-06-21 04:00:00+00:00                     0.817216   
2025-06-22 04:00:00+00:00                     0.678709   

                           vapour_pressure_deficit_max  \
date                                                     
1980-01-08 04:00:00+00:00                     0.067544   
1980-01-09 04:00:00+00:00                    -0.271330   
1980-01-10 04:00:00+00:00                    -0.318863   
1980-01-11 04:00:00+00:00                     0.970315   
1980-01-12 04:00:00+00:00                    -1.144075   
...                                                ...   
2025-06-18 04:00:00+00:00                     0.637523   
2025-06-19 04:00:00+00:00                     0.202919   
2025-06-20 04:00:00+00:00                     0.480632   
2025-06-21 04:00:00+00:00                     0.194522   
2025-06-22 04:00:00+00:00                     0.194522   

                           soil_temperature_0_to_7cm_mean       LOF  \
date                                                                  
1980-01-08 04:00:00+00:00                       -0.691370  0.071735   
1980-01-09 04:00:00+00:00                       -0.741405  0.071735   
1980-01-10 04:00:00+00:00                       -0.770006  0.071735   
1980-01-11 04:00:00+00:00                       -0.798600  0.071735   
1980-01-12 04:00:00+00:00                       -0.827196  0.071735   
...                                                   ...       ...   
2025-06-18 04:00:00+00:00                        0.132629  0.071735   
2025-06-19 04:00:00+00:00                        0.114759  0.071735   
2025-06-20 04:00:00+00:00                        0.109992  0.071735   
2025-06-21 04:00:00+00:00                        0.121909  0.071735   
2025-06-22 04:00:00+00:00                        0.138589  0.071735   

                           HBOS_class  
date                                   
1980-01-08 04:00:00+00:00           0  
1980-01-09 04:00:00+00:00           0  
1980-01-10 04:00:00+00:00           0  
1980-01-11 04:00:00+00:00           1  
1980-01-12 04:00:00+00:00           0  
...                               ...  
2025-06-18 04:00:00+00:00           0  
2025-06-19 04:00:00+00:00           0  
2025-06-20 04:00:00+00:00           0  
2025-06-21 04:00:00+00:00           0  
2025-06-22 04:00:00+00:00           0  

[16603 rows x 23 columns]
In [105]:
# Count the occurrences of each unique value in 'LOF'
counts = HBOS_data['HBOS_class'].value_counts()
print(counts)
HBOS_class
0    15772
1      831
Name: count, dtype: int64

Classification (Logit) Model Based On HBOS¶

In [107]:
# Create and fit the logistic regression model
log_reg = LogisticRegression(max_iter = 1000)
log_reg.fit(X_train, y_train)

# Make predictions
y_pred = log_reg.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))

# Check the coefficients of the logistic regression model
coefficients = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': log_reg.coef_[0]
})

print(coefficients.sort_values(by='Coefficient', ascending=False))
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      3141
           1       0.70      0.19      0.30       180

    accuracy                           0.95      3321
   macro avg       0.83      0.59      0.64      3321
weighted avg       0.94      0.95      0.94      3321

                           Feature  Coefficient
3        apparent_temperature_mean     1.791999
18     wet_bulb_temperature_2m_min     1.206207
6               wind_speed_10m_max     1.141004
13                pressure_msl_max     0.795647
5         apparent_temperature_min     0.392854
20  soil_temperature_0_to_7cm_mean     0.384362
8                         rain_sum     0.213222
7       et0_fao_evapotranspiration     0.140622
17     wet_bulb_temperature_2m_max     0.044277
4         apparent_temperature_max    -0.137123
21                             LOF    -0.195141
14                pressure_msl_min    -0.243705
19     vapour_pressure_deficit_max    -0.274131
9                 dew_point_2m_max    -0.284780
11            surface_pressure_max    -0.338216
15        relative_humidity_2m_max    -0.359617
12            surface_pressure_min    -0.428306
2               temperature_2m_min    -0.505661
0              temperature_2m_mean    -0.557331
1               temperature_2m_max    -0.915928
16        relative_humidity_2m_min    -1.116497
10                dew_point_2m_min    -1.441105

INTERPRETATION OF THE RESULTS:

Accuracy = 0.95: 95% of all predictions are correct.

Precision for class 1 (0.70): When the model predicts class 1, it's correct 70% of the time.

Recall for class 1 (0.19): The model only identifies 19% of the actual class 1 cases.

F1-score for class 1 (0.30): Low — suggests poor balance between precision and recall.

High precision but low recall for class 1: Model is conservative in predicting class 1 — when it does predict it, it's often right, but it misses most actual instances (false negatives are high).

This is a common issue in imbalanced classification problems.

Positive coefficients → increase likelihood of class 1

Negative coefficients → decrease likelihood of class 1

Higher mean apparent temperature, wet bulb temp, and wind speed are associated with increased probability of class 1. These might indicate weather stress, influencing your event target.

Lower dew point, humidity, and temperature max/mean decrease the likelihood of class 1. Possibly indicating that cooler, drier conditions are linked with class 0 (non-events).

Attempt to balance the data set:

In [111]:
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report

# Apply SMOTE to oversample the minority class
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# Train the logistic regression model again
log_reg_balanced = LogisticRegression(max_iter=1000, random_state=42)
log_reg_balanced.fit(X_train_balanced, y_train_balanced)

# Evaluate the model
y_pred_balanced = log_reg_balanced.predict(X_test)
print(classification_report(y_test, y_pred_balanced))

# Check the coefficients of the logistic regression model
coefficients = pd.DataFrame({
    'Feature': X_train_balanced.columns,
    'Coefficient': log_reg_balanced.coef_[0]
})

print(coefficients.sort_values(by='Coefficient', ascending=False))
              precision    recall  f1-score   support

           0       0.98      0.83      0.90      3141
           1       0.21      0.78      0.33       180

    accuracy                           0.83      3321
   macro avg       0.60      0.80      0.62      3321
weighted avg       0.94      0.83      0.87      3321

                           Feature  Coefficient
3        apparent_temperature_mean     3.077397
13                pressure_msl_max     2.507190
6               wind_speed_10m_max     1.933673
18     wet_bulb_temperature_2m_min     1.754571
5         apparent_temperature_min     1.170021
19     vapour_pressure_deficit_max     0.749079
20  soil_temperature_0_to_7cm_mean     0.559690
8                         rain_sum     0.309812
9                 dew_point_2m_max     0.279644
7       et0_fao_evapotranspiration     0.084072
17     wet_bulb_temperature_2m_max    -0.011749
21                             LOF    -0.289692
12            surface_pressure_min    -0.425241
16        relative_humidity_2m_min    -0.514414
4         apparent_temperature_max    -0.589974
14                pressure_msl_min    -0.759746
15        relative_humidity_2m_max    -0.770981
2               temperature_2m_min    -0.951287
1               temperature_2m_max    -1.346989
11            surface_pressure_max    -1.492041
0              temperature_2m_mean    -1.499307
10                dew_point_2m_min    -2.166019

INTERPRETATION --

Overall Performance: Accuracy: 0.83 — 83% of predictions are correct.

Macro Average F1: 0.62 — average F1 across both classes, treating them equally.

Weighted Average F1: 0.87 — average F1 weighted by class frequency (heavily influenced by class 0).

Recall for class 1 jumped from 0.19 → 0.78 ❗

Precision for class 1 dropped from 0.70 → 0.21

Accuracy dropped from 0.95 → 0.83

CLASS 1 (event/rare class): High recall (0.78): The model now catches most actual class 1 cases (few false negatives).

Low precision (0.21): But many of the predicted class 1s are wrong (many false positives).

F1-score (0.33): Modest — model is better at detecting events, but at the cost of many false alarms.

CLASS 0: High precision (0.98) and decent recall (0.83) — it still performs well, but not as perfectly as before.

ACHIEVEMENT: A recall-optimized model that detects more of the rare class (class 1).

Useful if the application values sensitivity over specificity.

Trade-offs: Higher false positives (lower precision) → might need post-processing or human review on class 1s.

Accuracy fell (because you’re now calling many more samples as class 1), but this is expected in imbalanced problems when optimizing recall.

CONSIDERATIONS: Support Vector Machine (classification) or ensemble methods (Random Forest or XGBoost) generally yield better performance, but at the cost of not have a explicity (analytical) model like the logit case; latter often desired by mathematical "calligraphy" enthusiasts.

Outlier Counts With Recent Data¶

To now observe oultier counters in the months of January and July based on the HBOS and "Prophet" algorithms:

In [114]:
HBOS_data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 16603 entries, 1980-01-08 04:00:00+00:00 to 2025-06-22 04:00:00+00:00
Data columns (total 23 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   temperature_2m_mean             16603 non-null  float64
 1   temperature_2m_max              16603 non-null  float64
 2   temperature_2m_min              16603 non-null  float64
 3   apparent_temperature_mean       16603 non-null  float64
 4   apparent_temperature_max        16603 non-null  float64
 5   apparent_temperature_min        16603 non-null  float64
 6   wind_speed_10m_max              16603 non-null  float64
 7   et0_fao_evapotranspiration      16603 non-null  float64
 8   rain_sum                        16603 non-null  float64
 9   dew_point_2m_max                16603 non-null  float64
 10  dew_point_2m_min                16603 non-null  float64
 11  surface_pressure_max            16603 non-null  float64
 12  surface_pressure_min            16603 non-null  float64
 13  pressure_msl_max                16603 non-null  float64
 14  pressure_msl_min                16603 non-null  float64
 15  relative_humidity_2m_max        16603 non-null  float64
 16  relative_humidity_2m_min        16603 non-null  float64
 17  wet_bulb_temperature_2m_max     16603 non-null  float64
 18  wet_bulb_temperature_2m_min     16603 non-null  float64
 19  vapour_pressure_deficit_max     16603 non-null  float64
 20  soil_temperature_0_to_7cm_mean  16603 non-null  float64
 21  LOF                             16603 non-null  float64
 22  HBOS_class                      16603 non-null  int32  
dtypes: float64(22), int32(1)
memory usage: 3.0 MB
In [115]:
HBOS_data_reset = HBOS_data.reset_index()
print(HBOS_data_reset.columns)
Index(['date', 'temperature_2m_mean', 'temperature_2m_max',
       'temperature_2m_min', 'apparent_temperature_mean',
       'apparent_temperature_max', 'apparent_temperature_min',
       'wind_speed_10m_max', 'et0_fao_evapotranspiration', 'rain_sum',
       'dew_point_2m_max', 'dew_point_2m_min', 'surface_pressure_max',
       'surface_pressure_min', 'pressure_msl_max', 'pressure_msl_min',
       'relative_humidity_2m_max', 'relative_humidity_2m_min',
       'wet_bulb_temperature_2m_max', 'wet_bulb_temperature_2m_min',
       'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean', 'LOF',
       'HBOS_class'],
      dtype='object')
In [116]:
from prophet import Prophet
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Ensure 'date' column is datetime
HBOS_data_reset['date'] = pd.to_datetime(HBOS_data_reset['date'])

# Extract year and month from the 'date' column
HBOS_data_reset['year'] = HBOS_data_reset['date'].dt.year
HBOS_data_reset['month'] = HBOS_data_reset['date'].dt.month

# Filter for HBOS outliers in January and July only
jan_july_HBOS_outliers = HBOS_data_reset[
    (HBOS_data_reset['HBOS_class'] == 1) &
    (HBOS_data_reset['month'].isin([1, 7]))
]

# Check if there are any rows in the filtered dataset
if jan_july_HBOS_outliers.empty:
    print("No outliers found for January and July. Please check the dataset or filtering criteria.")
else:
    # Group by year and month to count outliers
    outlier_counts = jan_july_HBOS_outliers.groupby(['year', 'month']).size().reset_index(name='outlier_count')

    # Create a 'date' column for Prophet (first day of each month)
    outlier_counts['date'] = pd.to_datetime(outlier_counts[['year', 'month']].assign(day=1))

    # Prepare data for Prophet: 'ds' and 'y'
    prophet_data = outlier_counts[['date', 'outlier_count']].rename(columns={'date': 'ds', 'outlier_count': 'y'})

    # Check for sufficient data
    if prophet_data.dropna().shape[0] < 2:
        print("The dataset has less than 2 non-NaN rows. Not enough data for Prophet model.")
    else:
        # Fit Prophet model
        model = Prophet()
        model.fit(prophet_data)

        # Forecast 24 months into the future
        future_dates = model.make_future_dataframe(periods=24, freq='MS')  # 'MS' = Month Start
        forecast = model.predict(future_dates)

        # Clip negative values and round forecasts to nearest integer
        forecast['yhat'] = forecast['yhat'].clip(lower=0).round()
        forecast['yhat_lower'] = forecast['yhat_lower'].clip(lower=0).round()
        forecast['yhat_upper'] = forecast['yhat_upper'].clip(lower=0).round()

        # Filter forecast for January and July
        forecast_jan_july = forecast[forecast['ds'].dt.month.isin([1, 7])]

        # Plot actual and forecasted data
        plt.figure(figsize=(12, 6))
        plt.plot(prophet_data['ds'], prophet_data['y'], marker='o', linestyle='-', label='Actual Outlier Count')
        plt.plot(forecast_jan_july['ds'], forecast_jan_july['yhat'], marker='o', linestyle='--',
                 color='orange', label='Forecasted Outlier Count')
        plt.fill_between(forecast_jan_july['ds'],
                         forecast_jan_july['yhat_lower'],
                         forecast_jan_july['yhat_upper'],
                         color='orange', alpha=0.3, label='Forecast CI')
        plt.xlabel('Date')
        plt.ylabel('Outlier Count')
        plt.title('HBOS Outlier Counts for January & July with 2-Year Forecast')
        plt.xticks(rotation=45)
        plt.legend()
        plt.grid(True)
        plt.tight_layout()
        plt.show()
23:08:37 - cmdstanpy - INFO - Chain [1] start processing
23:08:37 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image

Hourly Meteorological Data¶

Hourly meteorological climate data provides detailed information about weather conditions at specific intervals throughout the day. This type of data is essential for understanding short-term weather patterns, tracking changes in atmospheric conditions, and supporting various applications such as weather forecasting, climate modeling, and environmental research.

Key components of hourly meteorological data typically include:

  1. Temperature: Measures the air temperature at a specific height.

  2. Humidity: Indicates the amount of water vapor in the air.

  3. Pressure: Measures the atmospheric pressure.

  4. Wind speed and direction: Describes the movement of air.

  5. Precipitation: Records the amount and type of precipitation (e.g., rain, snow, hail).

  6. Solar radiation: Measures the amount of solar energy reaching the Earth's surface.

  7. Cloud cover: Indicates the percentage of the sky covered by clouds.

Hourly data is collected from various sources, including:

  1. Weather stations: Ground-based stations equipped with sensors to measure different meteorological parameters.

  2. Satellites: Remote sensing satellites that provide observations of the Earth's atmosphere and surface.

  3. Aircraft: Equipped with instruments to collect data during flights.

Applications of hourly meteorological data:

  1. Weather forecasting: Provides the basis for short-term and local weather forecasts.

  2. Environmental research: Supports studies on air quality, water resources, and ecological processes.

  3. Agriculture: Assists farmers in making decisions about planting, irrigation, and harvesting.

  4. Energy: Helps manage energy demand and supply based on weather conditions.

  5. Transportation: Aids in planning and operations of transportation systems, especially those affected by weather (e.g., aviation, shipping).

Challenges and Considerations:

  1. Data quality: Ensuring the accuracy and reliability of hourly data is crucial for its applications.

  2. Data availability: Not all locations have access to comprehensive hourly data, especially in remote or developing regions.

  3. Data assimilation: Combining data from different sources and using data assimilation techniques to improve the quality and consistency of the data.

  4. The model becomes more complex, requiring more computational power and advanced techniques to manage larger datasets and higher-frequency noise.

  5. Overfitting Risk: High-resolution data might lead to overfitting, especially if the true patterns are smoother or less variable over time.

  6. More Noise: Hourly data can introduce more noise due to short-term fluctuations or measurement errors, which may not be as important for long-term forecasting.

Hourly data provides more detail and can capture short-term fluctuations in weather conditions, such as temperature changes or wind speed variations that occur throughout the day.

If the objective is to make predictions for the next few hours, hourly data is more suitable.

Many meteorological variables (e.g., temperature, humidity) have strong diurnal patterns that can only be captured with high-frequency data. Such types of variables also capture atmospheric physics and chemistry aspects.

With hourly data, you typically have more data points, which can improve model training if the data is clean and well-processed.

Of consequence, use of daily meteorological data will be directed towards supervised learning and ensemble learning methods.

Hourly Meteorological Attributes¶

Variable Description

  1. temperature_2m - Instant °C (°F): Air temperature at 2 meters above ground.

  2. relative_humidity_2m - Instant (%): Relative humidity at 2 meters above ground.

  3. dew_point_2m - Instant °C (°F): Dew point temperature at 2 meters above ground.

  4. apparent_temperature - Instant °C (°F): Apparent temperature is the perceived feels-like temperature combining wind chill factor, relative humidity and solar radiation.

  5. pressure_msl and surface_pressure - Instant (hPa): Atmospheric air pressure reduced to mean sea level (msl), and pressure at surface. Typically pressure on mean sea level is used in meteorology. Surface pressure gets lower with increasing elevation.

  6. rain - Preceding hour sum mm (inch): Only liquid precipitation of the preceding hour including local showers and rain from large scale systems.

  7. shortwave_radiation- Preceding hour mean (W/m²): Shortwave solar radiation as average of the preceding hour. This is equal to the total global horizontal irradiation.

  8. direct_radiation and direct_normal_irradiance - Preceding hour mean (W/m²): Direct solar radiation as average of the preceding hour on the horizontal plane and the normal plane (perpendicular to the sun).

  9. diffuse_radiation - Preceding hour mean (W/m²): Diffuse solar radiation as average of the preceding hour.

  10. direct_normal_irradiance_instant (W/m²): any given instant refers to the amount of solar radiation received per unit area by a surface that is held perpendicular to the sun's rays at that specific moment.

  11. terrestrial_radiation_instant (W/m²): this usually refers to the instantaneous terrestrial (longwave) radiation at the Earth’s surface. It represents the longwave radiation emitted by the Earth’s surface at a specific instant in time.

Terrestrial radiation is part of the Earth’s surface energy balance — it’s the infrared energy emitted by the surface as it cools.

  1. wind_speed_10m and wind_speed_100m - Instant km/h (mph, m/s, knots): Wind speed at 10 or 100 meters above ground. Wind speed on 10 meters is the standard level.

  2. et0_fao_evapotranspiration - Preceding hour sum mm (inch):ET₀ Reference Evapotranspiration of a well watered grass field. Based on FAO-56 Penman-Monteith equations ET₀ is calculated from temperature, wind speed, humidity and solar radiation. Unlimited soil water is assumed. ET₀ is commonly used to estimate the required irrigation for plants.

  3. vapour_pressure_deficit - Instant (kPa): Vapor Pressure Deificit (VPD) in kilopascal (kPa). For high VPD (>1.6), water transpiration of plants increases. For low VPD (<0.4), transpiration decreases.

  4. {soil_temperature_0_to_7cm; soil_temperature_7_to_28cm; soil_temperature_28_to_100cm; soil_temperature_100_to_255cm } - Instant °C (°F):Average temperature of different soil levels below ground.

  5. {soil_moisture_0_to_7cm; soil_moisture_7_to_28cm; soil_moisture_28_to_100cm; soil_moisture_100_to_255cm} - Instant (m³/m³): Average soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths.

  6. total_column_integrated_water_vapour - represents the total amount of water vapor in a vertical column of the atmosphere, typically expressed in kg/m² or mm

  7. boundary_layer_height - the depth of the lowest part of the atmosphere that is directly influenced by the Earth's surface. This layer is characterized by turbulence and the exchange of heat, moisture, and momentum between the surface and the atmosphere. Its height varies depending on factors like time of day, season, and surface conditions, but it typically ranges from a few hundred meters to a few kilometers.

In [118]:
import openmeteo_requests

import pandas as pd
import requests_cache
from retry_requests import retry

# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)

# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://archive-api.open-meteo.com/v1/archive"
params = {
	"latitude": 16.7425,
	"longitude": -62.1874,
	"start_date": "2022-01-08",
	"end_date": "2025-06-24",
	"hourly": ["temperature_2m", "relative_humidity_2m", "dew_point_2m", "apparent_temperature", "rain", "pressure_msl", "surface_pressure", "et0_fao_evapotranspiration", "vapour_pressure_deficit", "wind_speed_10m", "wind_speed_100m", "soil_temperature_0_to_7cm", "soil_temperature_7_to_28cm", "soil_moisture_0_to_7cm", "soil_moisture_7_to_28cm", "boundary_layer_height", "wet_bulb_temperature_2m", "shortwave_radiation_instant", "direct_radiation_instant", "diffuse_radiation_instant", "direct_normal_irradiance_instant", "terrestrial_radiation_instant", "total_column_integrated_water_vapour", "albedo"],
	"timezone": "auto"
}
responses = openmeteo.weather_api(url, params=params)

# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()}{response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")

# Process hourly data. The order of variables needs to be the same as requested.
hourly = response.Hourly()
hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
hourly_relative_humidity_2m = hourly.Variables(1).ValuesAsNumpy()
hourly_dew_point_2m = hourly.Variables(2).ValuesAsNumpy()
hourly_apparent_temperature = hourly.Variables(3).ValuesAsNumpy()
hourly_rain = hourly.Variables(4).ValuesAsNumpy()
hourly_pressure_msl = hourly.Variables(5).ValuesAsNumpy()
hourly_surface_pressure = hourly.Variables(6).ValuesAsNumpy()
hourly_et0_fao_evapotranspiration = hourly.Variables(7).ValuesAsNumpy()
hourly_vapour_pressure_deficit = hourly.Variables(8).ValuesAsNumpy()
hourly_wind_speed_10m = hourly.Variables(9).ValuesAsNumpy()
hourly_wind_speed_100m = hourly.Variables(10).ValuesAsNumpy()
hourly_soil_temperature_0_to_7cm = hourly.Variables(11).ValuesAsNumpy()
hourly_soil_temperature_7_to_28cm = hourly.Variables(12).ValuesAsNumpy()
hourly_soil_moisture_0_to_7cm = hourly.Variables(13).ValuesAsNumpy()
hourly_soil_moisture_7_to_28cm = hourly.Variables(14).ValuesAsNumpy()
hourly_boundary_layer_height = hourly.Variables(15).ValuesAsNumpy()
hourly_wet_bulb_temperature_2m = hourly.Variables(16).ValuesAsNumpy()
hourly_shortwave_radiation_instant = hourly.Variables(17).ValuesAsNumpy()
hourly_direct_radiation_instant = hourly.Variables(18).ValuesAsNumpy()
hourly_diffuse_radiation_instant = hourly.Variables(19).ValuesAsNumpy()
hourly_direct_normal_irradiance_instant = hourly.Variables(20).ValuesAsNumpy()
hourly_terrestrial_radiation_instant = hourly.Variables(21).ValuesAsNumpy()
hourly_total_column_integrated_water_vapour = hourly.Variables(22).ValuesAsNumpy()
hourly_cloud_cover_mid = hourly.Variables(24).ValuesAsNumpy()

hourly_data = {"date": pd.date_range(
	start = pd.to_datetime(hourly.Time(), unit = "s", utc = True),
	end = pd.to_datetime(hourly.TimeEnd(), unit = "s", utc = True),
	freq = pd.Timedelta(seconds = hourly.Interval()),
	inclusive = "left"
)}

hourly_data["temperature_2m"] = hourly_temperature_2m
hourly_data["relative_humidity_2m"] = hourly_relative_humidity_2m
hourly_data["dew_point_2m"] = hourly_dew_point_2m
hourly_data["apparent_temperature"] = hourly_apparent_temperature
hourly_data["rain"] = hourly_rain
hourly_data["pressure_msl"] = hourly_pressure_msl
hourly_data["surface_pressure"] = hourly_surface_pressure
hourly_data["et0_fao_evapotranspiration"] = hourly_et0_fao_evapotranspiration
hourly_data["vapour_pressure_deficit"] = hourly_vapour_pressure_deficit
hourly_data["wind_speed_10m"] = hourly_wind_speed_10m
hourly_data["wind_speed_100m"] = hourly_wind_speed_100m
hourly_data["soil_temperature_0_to_7cm"] = hourly_soil_temperature_0_to_7cm
hourly_data["soil_temperature_7_to_28cm"] = hourly_soil_temperature_7_to_28cm
hourly_data["soil_moisture_0_to_7cm"] = hourly_soil_moisture_0_to_7cm
hourly_data["soil_moisture_7_to_28cm"] = hourly_soil_moisture_7_to_28cm
hourly_data["boundary_layer_height"] = hourly_boundary_layer_height
hourly_data["wet_bulb_temperature_2m"] = hourly_wet_bulb_temperature_2m
hourly_data["shortwave_radiation_instant"] = hourly_shortwave_radiation_instant
hourly_data["direct_radiation_instant"] = hourly_direct_radiation_instant
hourly_data["diffuse_radiation_instant"] = hourly_diffuse_radiation_instant
hourly_data["direct_normal_irradiance_instant"] = hourly_direct_normal_irradiance_instant
hourly_data["terrestrial_radiation_instant"] = hourly_terrestrial_radiation_instant
hourly_data["total_column_integrated_water_vapour"] = hourly_total_column_integrated_water_vapour
hourly_data["cloud_cover_mid"] = hourly_cloud_cover_mid

hourly_dataframe = pd.DataFrame(data = hourly_data)
print(hourly_dataframe)
hourly_dataframe.info()
Coordinates 16.76625633239746°N -62.20843505859375°E
Elevation 309.0 m asl
Timezone b'America/Montserrat'b'GMT-4'
Timezone difference to GMT+0 -14400 s
                           date  temperature_2m  relative_humidity_2m  \
0     2022-01-08 04:00:00+00:00       23.249001             71.679909   
1     2022-01-08 05:00:00+00:00       22.598999             76.695610   
2     2022-01-08 06:00:00+00:00       22.348999             76.176575   
3     2022-01-08 07:00:00+00:00       21.848999             79.526360   
4     2022-01-08 08:00:00+00:00       22.098999             80.060951   
...                         ...             ...                   ...   
30331 2025-06-24 23:00:00+00:00             NaN                   NaN   
30332 2025-06-25 00:00:00+00:00             NaN                   NaN   
30333 2025-06-25 01:00:00+00:00             NaN                   NaN   
30334 2025-06-25 02:00:00+00:00             NaN                   NaN   
30335 2025-06-25 03:00:00+00:00             NaN                   NaN   

       dew_point_2m  apparent_temperature  rain  pressure_msl  \
0         17.848999             21.988255   0.0   1018.500000   
1         18.299000             21.672226   0.0   1018.299988   
2         17.949001             20.790890   0.0   1017.599976   
3         18.148998             20.710756   0.1   1017.500000   
4         18.499001             20.978884   0.1   1017.400024   
...             ...                   ...   ...           ...   
30331           NaN                   NaN   NaN           NaN   
30332           NaN                   NaN   NaN           NaN   
30333           NaN                   NaN   NaN           NaN   
30334           NaN                   NaN   NaN           NaN   
30335           NaN                   NaN   NaN           NaN   

       surface_pressure  et0_fao_evapotranspiration  vapour_pressure_deficit  \
0            982.982544                    0.094050                 0.807554   
1            982.713318                    0.071562                 0.638901   
2            982.008240                    0.079427                 0.643319   
3            981.852722                    0.061376                 0.536290   
4            981.785583                    0.061776                 0.530283   
...                 ...                         ...                      ...   
30331               NaN                         NaN                      NaN   
30332               NaN                         NaN                      NaN   
30333               NaN                         NaN                      NaN   
30334               NaN                         NaN                      NaN   
30335               NaN                         NaN                      NaN   

       ...  soil_moisture_7_to_28cm  boundary_layer_height  \
0      ...                     0.07                  805.0   
1      ...                     0.07                  805.0   
2      ...                     0.07                  750.0   
3      ...                     0.07                  795.0   
4      ...                     0.07                  835.0   
...    ...                      ...                    ...   
30331  ...                      NaN                    NaN   
30332  ...                      NaN                    NaN   
30333  ...                      NaN                    NaN   
30334  ...                      NaN                    NaN   
30335  ...                      NaN                    NaN   

       wet_bulb_temperature_2m  shortwave_radiation_instant  \
0                    19.534161                          0.0   
1                    19.589806                          0.0   
2                    19.283928                          0.0   
3                    19.243179                          0.0   
4                    19.552332                          0.0   
...                        ...                          ...   
30331                      NaN                          NaN   
30332                      NaN                          NaN   
30333                      NaN                          NaN   
30334                      NaN                          NaN   
30335                      NaN                          NaN   

       direct_radiation_instant  diffuse_radiation_instant  \
0                           0.0                        0.0   
1                           0.0                        0.0   
2                           0.0                        0.0   
3                           0.0                        0.0   
4                           0.0                        0.0   
...                         ...                        ...   
30331                       NaN                        NaN   
30332                       NaN                        NaN   
30333                       NaN                        NaN   
30334                       NaN                        NaN   
30335                       NaN                        NaN   

       direct_normal_irradiance_instant  terrestrial_radiation_instant  \
0                                   0.0                            0.0   
1                                   0.0                            0.0   
2                                   0.0                            0.0   
3                                   0.0                            0.0   
4                                   0.0                            0.0   
...                                 ...                            ...   
30331                               NaN                            0.0   
30332                               NaN                            0.0   
30333                               NaN                            0.0   
30334                               NaN                            0.0   
30335                               NaN                            0.0   

       total_column_integrated_water_vapour  cloud_cover_mid  
0                                 33.200001                0  
1                                 33.400002                0  
2                                 33.500000                0  
3                                 33.700001                0  
4                                 33.299999                0  
...                                     ...              ...  
30331                                   NaN                0  
30332                                   NaN                0  
30333                                   NaN                0  
30334                                   NaN                0  
30335                                   NaN                0  

[30336 rows x 25 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30336 entries, 0 to 30335
Data columns (total 25 columns):
 #   Column                                Non-Null Count  Dtype              
---  ------                                --------------  -----              
 0   date                                  30336 non-null  datetime64[ns, UTC]
 1   temperature_2m                        30309 non-null  float32            
 2   relative_humidity_2m                  30309 non-null  float32            
 3   dew_point_2m                          30309 non-null  float32            
 4   apparent_temperature                  30309 non-null  float32            
 5   rain                                  30309 non-null  float32            
 6   pressure_msl                          30309 non-null  float32            
 7   surface_pressure                      30309 non-null  float32            
 8   et0_fao_evapotranspiration            30309 non-null  float32            
 9   vapour_pressure_deficit               30309 non-null  float32            
 10  wind_speed_10m                        30309 non-null  float32            
 11  wind_speed_100m                       30309 non-null  float32            
 12  soil_temperature_0_to_7cm             30309 non-null  float32            
 13  soil_temperature_7_to_28cm            30309 non-null  float32            
 14  soil_moisture_0_to_7cm                30309 non-null  float32            
 15  soil_moisture_7_to_28cm               30309 non-null  float32            
 16  boundary_layer_height                 25941 non-null  float32            
 17  wet_bulb_temperature_2m               30309 non-null  float32            
 18  shortwave_radiation_instant           30309 non-null  float32            
 19  direct_radiation_instant              30309 non-null  float32            
 20  diffuse_radiation_instant             30309 non-null  float32            
 21  direct_normal_irradiance_instant      30309 non-null  float32            
 22  terrestrial_radiation_instant         30336 non-null  float32            
 23  total_column_integrated_water_vapour  25941 non-null  float32            
 24  cloud_cover_mid                       30336 non-null  int64              
dtypes: datetime64[ns, UTC](1), float32(23), int64(1)
memory usage: 3.1 MB
In [119]:
hourly_dataframe_clean = hourly_dataframe.dropna()
hourly_dataframe_clean.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25941 entries, 0 to 30308
Data columns (total 25 columns):
 #   Column                                Non-Null Count  Dtype              
---  ------                                --------------  -----              
 0   date                                  25941 non-null  datetime64[ns, UTC]
 1   temperature_2m                        25941 non-null  float32            
 2   relative_humidity_2m                  25941 non-null  float32            
 3   dew_point_2m                          25941 non-null  float32            
 4   apparent_temperature                  25941 non-null  float32            
 5   rain                                  25941 non-null  float32            
 6   pressure_msl                          25941 non-null  float32            
 7   surface_pressure                      25941 non-null  float32            
 8   et0_fao_evapotranspiration            25941 non-null  float32            
 9   vapour_pressure_deficit               25941 non-null  float32            
 10  wind_speed_10m                        25941 non-null  float32            
 11  wind_speed_100m                       25941 non-null  float32            
 12  soil_temperature_0_to_7cm             25941 non-null  float32            
 13  soil_temperature_7_to_28cm            25941 non-null  float32            
 14  soil_moisture_0_to_7cm                25941 non-null  float32            
 15  soil_moisture_7_to_28cm               25941 non-null  float32            
 16  boundary_layer_height                 25941 non-null  float32            
 17  wet_bulb_temperature_2m               25941 non-null  float32            
 18  shortwave_radiation_instant           25941 non-null  float32            
 19  direct_radiation_instant              25941 non-null  float32            
 20  diffuse_radiation_instant             25941 non-null  float32            
 21  direct_normal_irradiance_instant      25941 non-null  float32            
 22  terrestrial_radiation_instant         25941 non-null  float32            
 23  total_column_integrated_water_vapour  25941 non-null  float32            
 24  cloud_cover_mid                       25941 non-null  int64              
dtypes: datetime64[ns, UTC](1), float32(23), int64(1)
memory usage: 2.9 MB
In [120]:
hourly_dataframe_clean.isna().sum()
Out[120]:
date                                    0
temperature_2m                          0
relative_humidity_2m                    0
dew_point_2m                            0
apparent_temperature                    0
rain                                    0
pressure_msl                            0
surface_pressure                        0
et0_fao_evapotranspiration              0
vapour_pressure_deficit                 0
wind_speed_10m                          0
wind_speed_100m                         0
soil_temperature_0_to_7cm               0
soil_temperature_7_to_28cm              0
soil_moisture_0_to_7cm                  0
soil_moisture_7_to_28cm                 0
boundary_layer_height                   0
wet_bulb_temperature_2m                 0
shortwave_radiation_instant             0
direct_radiation_instant                0
diffuse_radiation_instant               0
direct_normal_irradiance_instant        0
terrestrial_radiation_instant           0
total_column_integrated_water_vapour    0
cloud_cover_mid                         0
dtype: int64

Summary Statistics for Hourly data¶

In [122]:
# Drop the first column and determinie summary statistics.
hourly_data_sans_first_col = hourly_dataframe_clean.iloc[:, 1:]
hourly_summary_stats = hourly_data_sans_first_col.describe()
print(hourly_summary_stats)
       temperature_2m  relative_humidity_2m  dew_point_2m  \
count    25941.000000          25941.000000  25941.000000   
mean        25.154356             74.224205     20.160091   
std          1.548129              7.545050      2.046124   
min         19.398998             35.666039     10.199000   
25%         24.098999             69.562462     18.648998   
50%         25.199001             74.784309     20.398998   
75%         26.348999             79.880203     21.799000   
max         29.598999             94.986580     24.199001   

       apparent_temperature          rain  pressure_msl  surface_pressure  \
count          25941.000000  25941.000000  25941.000000      25941.000000   
mean              25.536901      0.098462   1014.861694        979.691528   
std                2.717877      0.474637      2.277580          2.160366   
min               17.866764      0.000000   1003.400024        968.797668   
25%               23.569269      0.000000   1013.500000        978.373596   
50%               25.550934      0.000000   1015.000000        979.846619   
75%               27.414621      0.100000   1016.400024        981.179199   
max               34.207329     18.799999   1021.900024        986.293640   

       et0_fao_evapotranspiration  vapour_pressure_deficit  wind_speed_10m  \
count                25941.000000             25941.000000    25941.000000   
mean                     0.207622                 0.830922       26.346403   
std                      0.173408                 0.270467        8.088293   
min                      0.000000                 0.144631        0.360000   
25%                      0.074283                 0.632275       21.612743   
50%                      0.118418                 0.807082       26.693459   
75%                      0.346611                 0.981912       31.698402   
max                      0.731624                 2.260390       64.005127   

       ...  soil_moisture_7_to_28cm  boundary_layer_height  \
count  ...             25941.000000           25941.000000   
mean   ...                 0.046259             776.942505   
std    ...                 0.053182             201.177765   
min    ...                 0.000000             115.000000   
25%    ...                 0.000000             645.000000   
50%    ...                 0.000000             770.000000   
75%    ...                 0.082000             900.000000   
max    ...                 0.353000            1880.000000   

       wet_bulb_temperature_2m  shortwave_radiation_instant  \
count             25941.000000                 25941.000000   
mean                 21.658060                   245.739502   
std                   1.639583                   320.305603   
min                  16.066504                     0.000000   
25%                  20.349979                     0.000000   
50%                  21.830799                     0.000000   
75%                  23.056477                   516.396912   
max                  24.967825                  1028.832153   

       direct_radiation_instant  diffuse_radiation_instant  \
count              25941.000000               25941.000000   
mean                 184.713196                  61.026295   
std                  259.038788                  79.678406   
min                    0.000000                   0.000000   
25%                    0.000000                   0.000000   
50%                    0.000000                   0.000000   
75%                  370.946777                 113.992241   
max                  925.033020                 449.724121   

       direct_normal_irradiance_instant  terrestrial_radiation_instant  \
count                      25941.000000                   25941.000000   
mean                         272.184357                     401.996582   
std                          333.275879                     488.299744   
min                            0.000000                       0.000000   
25%                            0.000000                       0.000000   
50%                            0.000000                      11.177355   
75%                          613.130493                     910.017212   
max                         1010.743530                    1349.767578   

       total_column_integrated_water_vapour  cloud_cover_mid  
count                          25941.000000          25941.0  
mean                              39.281063              0.0  
std                                9.275361              0.0  
min                               16.200001              0.0  
25%                               32.200001              0.0  
50%                               38.700001              0.0  
75%                               46.000000              0.0  
max                               72.699997              0.0  

[8 rows x 24 columns]

Skew and Kurtosis¶

In [124]:
import scipy.stats as stats
#Skew and kurtosis
skewness_hourly = hourly_data_sans_first_col.skew()
kurtosis_hourly = hourly_data_sans_first_col.kurtosis()
print("Skewness:")
print(skewness_hourly)
print("\nKurtosis:")
print(kurtosis_hourly)
Skewness:
temperature_2m                          -0.196995
relative_humidity_2m                    -0.509483
dew_point_2m                            -0.428209
apparent_temperature                     0.071160
rain                                    15.350304
pressure_msl                            -0.420989
surface_pressure                        -0.409151
et0_fao_evapotranspiration               0.892362
vapour_pressure_deficit                  0.702015
wind_speed_10m                          -0.264995
wind_speed_100m                         -0.340911
soil_temperature_0_to_7cm                1.467343
soil_temperature_7_to_28cm               0.589537
soil_moisture_0_to_7cm                   2.719525
soil_moisture_7_to_28cm                  0.925203
boundary_layer_height                    0.420913
wet_bulb_temperature_2m                 -0.279724
shortwave_radiation_instant              0.892665
direct_radiation_instant                 1.098373
diffuse_radiation_instant                1.302339
direct_normal_irradiance_instant         0.658273
terrestrial_radiation_instant            0.686273
total_column_integrated_water_vapour     0.233287
cloud_cover_mid                          0.000000
dtype: float64

Kurtosis:
temperature_2m                           -0.447634
relative_humidity_2m                      0.118305
dew_point_2m                             -0.424773
apparent_temperature                     -0.449123
rain                                    344.158569
pressure_msl                              0.306904
surface_pressure                          0.336213
et0_fao_evapotranspiration               -0.599063
vapour_pressure_deficit                   0.576414
wind_speed_10m                            0.280692
wind_speed_100m                           0.293887
soil_temperature_0_to_7cm                 2.663844
soil_temperature_7_to_28cm                0.020899
soil_moisture_0_to_7cm                   10.640102
soil_moisture_7_to_28cm                   0.636145
boundary_layer_height                     0.949319
wet_bulb_temperature_2m                  -0.895999
shortwave_radiation_instant              -0.774997
direct_radiation_instant                 -0.281340
diffuse_radiation_instant                 1.431475
direct_normal_irradiance_instant         -1.255075
terrestrial_radiation_instant            -1.190991
total_column_integrated_water_vapour     -0.460983
cloud_cover_mid                           0.000000
dtype: float64

Histograms For Hourly Data¶

NOTE: Quantile-Quantile plots like for daily data will not be entertained for hourly data, because hourly data is very large in volume. For example, three years will have more instances than 4 decades.

In [126]:
import matplotlib.pyplot as plt
import seaborn as sns

# Get the column names
column_names = hourly_data_sans_first_col.columns
print(column_names)
column_names_list = column_names.tolist()

# Calculating the number of ros and columns for subplots.
num_cols = 3  # 3 columns
num_rows = (len(column_names_list) + num_cols - 1) // num_cols
     # Calculating the number of rows

# Creating subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize = (15, 10))

# Flatten if required.
if num_rows > 1:
  axes = axes.flatten()

# Plot the histograms
for i, col in enumerate(column_names_list):
  sns.histplot(data = hourly_data_sans_first_col[col], ax = axes[i], kde = True)
  axes[i].set_title(f'Histogram of {col}')
  axes[i].set_xlabel('Value')
  axes[i].set_ylabel('Frequency')
  axes[i].grid(True)
# Adjust layout
plt.tight_layout()
plt.show()
Index(['temperature_2m', 'relative_humidity_2m', 'dew_point_2m',
       'apparent_temperature', 'rain', 'pressure_msl', 'surface_pressure',
       'et0_fao_evapotranspiration', 'vapour_pressure_deficit',
       'wind_speed_10m', 'wind_speed_100m', 'soil_temperature_0_to_7cm',
       'soil_temperature_7_to_28cm', 'soil_moisture_0_to_7cm',
       'soil_moisture_7_to_28cm', 'boundary_layer_height',
       'wet_bulb_temperature_2m', 'shortwave_radiation_instant',
       'direct_radiation_instant', 'diffuse_radiation_instant',
       'direct_normal_irradiance_instant', 'terrestrial_radiation_instant',
       'total_column_integrated_water_vapour', 'cloud_cover_mid'],
      dtype='object')
No description has been provided for this image

Correlation Analysis for Hourly Data¶

In [128]:
# Applying pearson correlation to the data set.
import matplotlib.pyplot as plt
import seaborn as sns
pearson_corr_hourly = hourly_dataframe_clean.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (20, 16))
sns.heatmap(pearson_corr_hourly, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation Heatmap for Hourly Data')
plt.savefig('heatmap.pdf', format='pdf')
plt.show()
No description has been provided for this image

For Pearson correlation applied prior, the above Pearson correlation heatmap conveys much about association among the variables, and level of linearity. Again, naturally, data pairs don't need to have a linear relationship in general.

For semi-diurnal (12 hour case) will choose the 17th of September 2024, being the day when the moon is most active, from 4 PM to 4 AM September 18th, 2024. Atmospheric tides show periodic behavior. Use Fourier analysis to detect these periodicities, say, a general 12 hour periodic setting:

Short-Term Forecasting With Hourly Data¶

Similar to what was done with the Prophet algorithm for long-term forecasting with daily data, will now be done with hourly data for short-term forecasting.

In [132]:
from prophet import Prophet 

# Create a copy of the original DataFrame
df_copy = hourly_dataframe_clean.copy()

# Check the column names (Debug line to verify the data structure)
print(df_copy.columns)

# Define the columns you are interested in
target_columns = ['apparent_temperature', 'temperature_2m',
'relative_humidity_2m', 'dew_point_2m', 'pressure_msl',
                  'boundary_layer_height', 'et0_fao_evapotranspiration',
                  'wet_bulb_temperature_2m', 'vapour_pressure_deficit']

# Proceed with forecasting for each selected column in the copied DataFrame
for col in target_columns:  # Limit iteration to the chosen columns
    # Create a temporary DataFrame with "date" as 'ds' and the target column as 'y'
    df_temp = df_copy[['date', col]].rename(columns={'date': 'ds', col: 'y'})

    # Remove timezone information from the 'ds' column
    df_temp['ds'] = df_temp['ds'].dt.tz_localize(None)  # Make datetime naive

    # Initialize and fit Prophet model
    model = Prophet()
    model.fit(df_temp)

    # Create future dataframe for forecasting (next 8 hours, for example)
    future = model.make_future_dataframe(periods=8, freq='h')

    # Generate forecast
    forecast = model.predict(future)

    # Output the forecast for this column
    print(f"Forecast for {col}:")
    print(forecast[['ds', 'yhat']].tail(8))  # Only showing the forecast for the next 8 periods
Index(['date', 'temperature_2m', 'relative_humidity_2m', 'dew_point_2m',
       'apparent_temperature', 'rain', 'pressure_msl', 'surface_pressure',
       'et0_fao_evapotranspiration', 'vapour_pressure_deficit',
       'wind_speed_10m', 'wind_speed_100m', 'soil_temperature_0_to_7cm',
       'soil_temperature_7_to_28cm', 'soil_moisture_0_to_7cm',
       'soil_moisture_7_to_28cm', 'boundary_layer_height',
       'wet_bulb_temperature_2m', 'shortwave_radiation_instant',
       'direct_radiation_instant', 'diffuse_radiation_instant',
       'direct_normal_irradiance_instant', 'terrestrial_radiation_instant',
       'total_column_integrated_water_vapour', 'cloud_cover_mid'],
      dtype='object')
23:09:22 - cmdstanpy - INFO - Chain [1] start processing
23:09:51 - cmdstanpy - INFO - Chain [1] done processing
Forecast for apparent_temperature:
                       ds       yhat
25941 2025-06-24 01:00:00  23.520416
25942 2025-06-24 02:00:00  23.381653
25943 2025-06-24 03:00:00  23.230667
25944 2025-06-24 04:00:00  23.109773
25945 2025-06-24 05:00:00  23.010959
25946 2025-06-24 06:00:00  22.903244
25947 2025-06-24 07:00:00  22.782459
25948 2025-06-24 08:00:00  22.687054
23:09:57 - cmdstanpy - INFO - Chain [1] start processing
23:10:19 - cmdstanpy - INFO - Chain [1] done processing
Forecast for temperature_2m:
                       ds       yhat
25941 2025-06-24 01:00:00  25.058977
25942 2025-06-24 02:00:00  24.973953
25943 2025-06-24 03:00:00  24.857244
25944 2025-06-24 04:00:00  24.736486
25945 2025-06-24 05:00:00  24.636620
25946 2025-06-24 06:00:00  24.551876
25947 2025-06-24 07:00:00  24.460657
25948 2025-06-24 08:00:00  24.367929
23:10:27 - cmdstanpy - INFO - Chain [1] start processing
23:10:44 - cmdstanpy - INFO - Chain [1] done processing
Forecast for relative_humidity_2m:
                       ds       yhat
25941 2025-06-24 01:00:00  80.488718
25942 2025-06-24 02:00:00  80.966981
25943 2025-06-24 03:00:00  81.477646
25944 2025-06-24 04:00:00  81.913806
25945 2025-06-24 05:00:00  82.209654
25946 2025-06-24 06:00:00  82.436414
25947 2025-06-24 07:00:00  82.712208
25948 2025-06-24 08:00:00  83.010652
23:10:50 - cmdstanpy - INFO - Chain [1] start processing
23:11:06 - cmdstanpy - INFO - Chain [1] done processing
Forecast for dew_point_2m:
                       ds       yhat
25941 2025-06-24 01:00:00  21.372553
25942 2025-06-24 02:00:00  21.390714
25943 2025-06-24 03:00:00  21.390129
25944 2025-06-24 04:00:00  21.371335
25945 2025-06-24 05:00:00  21.339052
25946 2025-06-24 06:00:00  21.301854
25947 2025-06-24 07:00:00  21.268606
25948 2025-06-24 08:00:00  21.243605
23:11:13 - cmdstanpy - INFO - Chain [1] start processing
23:11:44 - cmdstanpy - INFO - Chain [1] done processing
Forecast for pressure_msl:
                       ds         yhat
25941 2025-06-24 01:00:00  1019.488802
25942 2025-06-24 02:00:00  1019.823980
25943 2025-06-24 03:00:00  1019.807950
25944 2025-06-24 04:00:00  1019.455391
25945 2025-06-24 05:00:00  1018.911814
25946 2025-06-24 06:00:00  1018.377475
25947 2025-06-24 07:00:00  1018.016779
25948 2025-06-24 08:00:00  1017.911495
23:11:50 - cmdstanpy - INFO - Chain [1] start processing
23:12:11 - cmdstanpy - INFO - Chain [1] done processing
Forecast for boundary_layer_height:
                       ds        yhat
25941 2025-06-24 01:00:00  926.688918
25942 2025-06-24 02:00:00  926.878159
25943 2025-06-24 03:00:00  926.378244
25944 2025-06-24 04:00:00  923.726626
25945 2025-06-24 05:00:00  917.254922
25946 2025-06-24 06:00:00  907.935746
25947 2025-06-24 07:00:00  899.915152
25948 2025-06-24 08:00:00  897.750750
23:12:18 - cmdstanpy - INFO - Chain [1] start processing
23:12:25 - cmdstanpy - INFO - Chain [1] done processing
Forecast for et0_fao_evapotranspiration:
                       ds      yhat
25941 2025-06-24 01:00:00  0.073359
25942 2025-06-24 02:00:00  0.076604
25943 2025-06-24 03:00:00  0.073166
25944 2025-06-24 04:00:00  0.066872
25945 2025-06-24 05:00:00  0.064847
25946 2025-06-24 06:00:00  0.067964
25947 2025-06-24 07:00:00  0.069845
25948 2025-06-24 08:00:00  0.065331
23:12:33 - cmdstanpy - INFO - Chain [1] start processing
23:12:52 - cmdstanpy - INFO - Chain [1] done processing
Forecast for wet_bulb_temperature_2m:
                       ds       yhat
25941 2025-06-24 01:00:00  22.342584
25942 2025-06-24 02:00:00  22.327333
25943 2025-06-24 03:00:00  22.287810
25944 2025-06-24 04:00:00  22.234396
25945 2025-06-24 05:00:00  22.180021
25946 2025-06-24 06:00:00  22.128784
25947 2025-06-24 07:00:00  22.078052
25948 2025-06-24 08:00:00  22.030733
23:12:58 - cmdstanpy - INFO - Chain [1] start processing
23:13:08 - cmdstanpy - INFO - Chain [1] done processing
Forecast for vapour_pressure_deficit:
                       ds      yhat
25941 2025-06-24 01:00:00  0.605932
25942 2025-06-24 02:00:00  0.587156
25943 2025-06-24 03:00:00  0.565609
25944 2025-06-24 04:00:00  0.546271
25945 2025-06-24 05:00:00  0.532887
25946 2025-06-24 06:00:00  0.523015
25947 2025-06-24 07:00:00  0.511349
25948 2025-06-24 08:00:00  0.498137

Apparent Temperature: The Temperature You Feel, Not Just the Temperature on the Thermometer¶

Apparent temperature, often referred to as the "feels like" temperature, is a measure of how hot or cold it feels outside, taking into account factors beyond just the air temperature. These factors include humidity and wind speed, which significantly impact our body's ability to regulate temperature.

When the air is humid, sweat, our body's natural cooling mechanism, evaporates less efficiently. This makes it harder for our bodies to cool down, leading to a higher perceived temperature. Conversely, when the air is dry, sweat evaporates more readily, making us feel cooler.

Wind chill, on the other hand, is the effect of wind on the perceived temperature when it's cold. As wind speeds increase, it accelerates heat loss from our bodies, making us feel colder than the actual air temperature.

Feature Selection for Apparent Temperature with Hourly Meteorological Data¶

Now, there's interest in developing predictive models involving possible relationships between atmospheric physical/chemical attributes instead of predictions based on chronologically sequenced of the atmospheric physical/chemical attributes. In similar fashion to feature selection for daily meteorological data, feature selection will also be done for hourly meteorological data.

For the hourly data set will focus on 'apparent_temperature' due to:

  1. Computational expense and time dealing with hourly data over multiple years.
  2. The 'apparent_temperature' target will be applied for later on development.
In [134]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split

# Assuming your DataFrame is named 'hourly_data_sans_first_col'
# Define features and target variable
X = hourly_data_sans_first_col.drop(columns=['apparent_temperature'])  # Features
y = hourly_data_sans_first_col['apparent_temperature']                   # Target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=50, random_state=42)

# Fit the model
rf_model.fit(X_train, y_train)

# Get feature importances
importances = rf_model.feature_importances_

# Create a DataFrame for feature importances
feature_importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Plot feature importances
plt.figure(figsize=(12, 6))
plt.barh(feature_importances['Feature'], feature_importances['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.title('Feature Importances from Random Forest')
plt.gca().invert_yaxis()  # Invert y-axis to have the most important feature on top
plt.show()

# Print ranked features based on importance
print("Ranked Features based on Importance:")
print(feature_importances)

# Recursive Feature Elimination (RFE)
rfe = RFE(estimator=rf_model, n_features_to_select=5)  # Select top 5 features
rfe.fit(X_train, y_train)

# Get selected features
selected_features = X.columns[rfe.support_]
print("Selected Features by RFE:")
print(selected_features)
No description has been provided for this image
Ranked Features based on Importance:
                                 Feature  Importance
0                         temperature_2m    0.593167
9                        wind_speed_100m    0.165724
15               wet_bulb_temperature_2m    0.162016
8                         wind_speed_10m    0.031364
6             et0_fao_evapotranspiration    0.019404
14                 boundary_layer_height    0.011443
16           shortwave_radiation_instant    0.006382
17              direct_radiation_instant    0.003906
2                           dew_point_2m    0.001083
20         terrestrial_radiation_instant    0.000916
10             soil_temperature_0_to_7cm    0.000650
11            soil_temperature_7_to_28cm    0.000600
21  total_column_integrated_water_vapour    0.000589
1                   relative_humidity_2m    0.000575
7                vapour_pressure_deficit    0.000501
19      direct_normal_irradiance_instant    0.000311
5                       surface_pressure    0.000300
4                           pressure_msl    0.000265
18             diffuse_radiation_instant    0.000255
13               soil_moisture_7_to_28cm    0.000237
12                soil_moisture_0_to_7cm    0.000210
3                                   rain    0.000100
22                       cloud_cover_mid    0.000000
Selected Features by RFE:
Index(['temperature_2m', 'et0_fao_evapotranspiration', 'wind_speed_10m',
       'wind_speed_100m', 'wet_bulb_temperature_2m'],
      dtype='object')
In [135]:
# Applying pearson correlation to the data set.

appar_hourly = hourly_data_sans_first_col[['temperature_2m', 'wind_speed_100m', 'wet_bulb_temperature_2m', 
                                                      'wind_speed_10m', 'et0_fao_evapotranspiration',
                                           'boundary_layer_height',  'direct_radiation_instant']]
appar_corr = appar_hourly.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (18, 14))
sns.heatmap(appar_corr, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation Heatmap for Apparent Temperature')
plt.savefig('heatmap.pdf', format='pdf')
plt.show()
No description has been provided for this image

Based on the importance/rank of the features, along with the correlation heatmap in consideration, some features will be dropped to possibly treat multicollinearity issues.

To now examine the quality of the resulting quantile regression model.

In [138]:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt

# Assuming your DataFrame is named 'hourly_data_sans_first_col'
# Define features and target variable
X = hourly_data_sans_first_col[['temperature_2m', 'wind_speed_100m',
                     'wet_bulb_temperature_2m',
                     'et0_fao_evapotranspiration', 'boundary_layer_height']]
y = hourly_data_sans_first_col[['apparent_temperature']]

# Add a constant to the model (intercept)
X = sm.add_constant(X)

# Fit the quantile regression model for the 0.5 quantile (median)
quantiles = [0.25, 0.5, 0.75]  # Define quantiles of interest
models = {}

for q in quantiles:
    model = sm.QuantReg(y, X)
    results = model.fit(q=q)
    models[q] = results
    print(f"Quantile Regression Results for q={q}:")
    print(results.summary())
    print("\n")
Quantile Regression Results for q=0.25:
                          QuantReg Regression Results                           
================================================================================
Dep. Variable:     apparent_temperature   Pseudo R-squared:               0.8877
Model:                         QuantReg   Bandwidth:                     0.05118
Method:                   Least Squares   Sparsity:                       0.6802
Date:                  Fri, 27 Jun 2025   No. Observations:                25941
Time:                          23:19:06   Df Residuals:                    25935
                                          Df Model:                            5
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const                         -6.1065      0.032   -192.606      0.000      -6.169      -6.044
temperature_2m                 0.7347      0.003    293.507      0.000       0.730       0.740
wind_speed_100m               -0.1276      0.000   -490.656      0.000      -0.128      -0.127
wet_bulb_temperature_2m        0.7634      0.002    315.794      0.000       0.759       0.768
et0_fao_evapotranspiration     1.3379      0.011    120.982      0.000       1.316       1.360
boundary_layer_height         -0.0001   1.49e-05     -7.564      0.000      -0.000   -8.34e-05
==============================================================================================

The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.


Quantile Regression Results for q=0.5:
                          QuantReg Regression Results                           
================================================================================
Dep. Variable:     apparent_temperature   Pseudo R-squared:               0.8800
Model:                         QuantReg   Bandwidth:                     0.05505
Method:                   Least Squares   Sparsity:                       0.8144
Date:                  Fri, 27 Jun 2025   No. Observations:                25941
Time:                          23:19:07   Df Residuals:                    25935
                                          Df Model:                            5
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const                         -4.6365      0.047    -97.663      0.000      -4.730      -4.543
temperature_2m                 0.6652      0.004    181.683      0.000       0.658       0.672
wind_speed_100m               -0.1294      0.000   -356.885      0.000      -0.130      -0.129
wet_bulb_temperature_2m        0.7780      0.004    219.616      0.000       0.771       0.785
et0_fao_evapotranspiration     2.2097      0.018    120.089      0.000       2.174       2.246
boundary_layer_height       3.946e-05   2.05e-05      1.926      0.054   -6.97e-07    7.96e-05
==============================================================================================

The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.


Quantile Regression Results for q=0.75:
                          QuantReg Regression Results                           
================================================================================
Dep. Variable:     apparent_temperature   Pseudo R-squared:               0.8850
Model:                         QuantReg   Bandwidth:                     0.05190
Method:                   Least Squares   Sparsity:                       0.7975
Date:                  Fri, 27 Jun 2025   No. Observations:                25941
Time:                          23:19:08   Df Residuals:                    25935
                                          Df Model:                            5
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const                         -4.6079      0.038   -122.370      0.000      -4.682      -4.534
temperature_2m                 0.6597      0.003    207.722      0.000       0.654       0.666
wind_speed_100m               -0.1265      0.000   -413.064      0.000      -0.127      -0.126
wet_bulb_temperature_2m        0.7859      0.003    251.546      0.000       0.780       0.792
et0_fao_evapotranspiration     2.8413      0.018    157.897      0.000       2.806       2.877
boundary_layer_height      -5.767e-06   1.68e-05     -0.343      0.731   -3.87e-05    2.72e-05
==============================================================================================

The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.


As for any remaining multicollinearity issue(s), the reasonable "pruning" would be dropping the 'wet_bulb_temperature_2m' feature.

In [140]:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt

# Assuming your DataFrame is named 'hourly_data_sans_first_col'
# Define features and target variable
X = hourly_data_sans_first_col[['temperature_2m', 'wind_speed_100m',
                     'et0_fao_evapotranspiration', 'boundary_layer_height']]
y = hourly_data_sans_first_col[['apparent_temperature']]

# Add a constant to the model (intercept)
X = sm.add_constant(X)

# Fit the quantile regression model for the 0.5 quantile (median)
quantiles = [0.25, 0.5, 0.75]  # Define quantiles of interest
models = {}

for q in quantiles:
    model = sm.QuantReg(y, X)
    results = model.fit(q=q)
    models[q] = results
    print(f"Quantile Regression Results for q={q}:")
    print(results.summary())
    print("\n")
Quantile Regression Results for q=0.25:
                          QuantReg Regression Results                           
================================================================================
Dep. Variable:     apparent_temperature   Pseudo R-squared:               0.7829
Model:                         QuantReg   Bandwidth:                     0.08148
Method:                   Least Squares   Sparsity:                        1.595
Date:                  Fri, 27 Jun 2025   No. Observations:                25941
Time:                          23:19:08   Df Residuals:                    25936
                                          Df Model:                            4
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const                         -4.9731      0.076    -65.062      0.000      -5.123      -4.823
temperature_2m                 1.4022      0.003    473.345      0.000       1.396       1.408
wind_speed_100m               -0.0885      0.001   -166.070      0.000      -0.090      -0.087
et0_fao_evapotranspiration    -0.6781      0.024    -27.887      0.000      -0.726      -0.630
boundary_layer_height         -0.0031   2.47e-05   -124.506      0.000      -0.003      -0.003
==============================================================================================

The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.


Quantile Regression Results for q=0.5:
                          QuantReg Regression Results                           
================================================================================
Dep. Variable:     apparent_temperature   Pseudo R-squared:               0.7756
Model:                         QuantReg   Bandwidth:                     0.09265
Method:                   Least Squares   Sparsity:                        1.374
Date:                  Fri, 27 Jun 2025   No. Observations:                25941
Time:                          23:19:08   Df Residuals:                    25936
                                          Df Model:                            4
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const                         -4.4653      0.080    -55.978      0.000      -4.622      -4.309
temperature_2m                 1.3925      0.003    450.059      0.000       1.386       1.399
wind_speed_100m               -0.0925      0.001   -168.418      0.000      -0.094      -0.091
et0_fao_evapotranspiration     0.2445      0.028      8.848      0.000       0.190       0.299
boundary_layer_height         -0.0030   2.62e-05   -114.357      0.000      -0.003      -0.003
==============================================================================================

The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.


Quantile Regression Results for q=0.75:
                          QuantReg Regression Results                           
================================================================================
Dep. Variable:     apparent_temperature   Pseudo R-squared:               0.7652
Model:                         QuantReg   Bandwidth:                     0.08407
Method:                   Least Squares   Sparsity:                        1.985
Date:                  Fri, 27 Jun 2025   No. Observations:                25941
Time:                          23:19:09   Df Residuals:                    25936
                                          Df Model:                            4
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const                         -4.0197      0.102    -39.508      0.000      -4.219      -3.820
temperature_2m                 1.3826      0.004    352.417      0.000       1.375       1.390
wind_speed_100m               -0.0961      0.001   -133.772      0.000      -0.097      -0.095
et0_fao_evapotranspiration     1.2045      0.039     31.252      0.000       1.129       1.280
boundary_layer_height         -0.0028   3.38e-05    -82.909      0.000      -0.003      -0.003
==============================================================================================

The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.


Observed is a drastic drop in Psuedo R-squared, hence, the 'wet_bulb_temperature_2m' feature should be retained. Yet, observing the remiaing "high" value correlation pairs, the multicollinearity concern can be ignored.

NOTE: the above summary statistics only concerns the applied data set. Say, time orientation and place of interest.

The Heat Index: A Measure of Perceived Temperature¶

The heat index is a measure of how hot it feels outside when the relative humidity is factored in with the air temperature. It combines the effects of both temperature and humidity to provide a more accurate representation of the perceived temperature, especially in hot and humid environments.

When the air temperature rises and the relative humidity increases, the human body's ability to cool itself through perspiration becomes less efficient. This is because the moisture in the air slows down the evaporation process, which is essential for cooling. As a result, the body feels hotter than the actual air temperature.

The heat index is calculated using a mathematical formula that takes into account both temperature and relative humidity. It is expressed in degrees Fahrenheit or Celsius, and it provides a more accurate indication of how hot it feels to a person. A high heat index can pose health risks, especially for vulnerable populations such as the elderly, young children, and those with certain medical conditions.

Understanding the heat index is important for individuals and communities to take appropriate precautions during hot weather. By being aware of the heat index, people can stay hydrated, avoid strenuous activities during peak heat hours, and take steps to protect themselves from heat-related illnesses.

From Anderson et al 2013, the (American) National Weather Service (NWS) uses its own complex algorithm for forecasts and heat warnings, and has created a website that calculates heat index using this algorithm, although only for one heat index value at a time (NWS 2011). It's algorithm:

https://ehp.niehs.nih.gov/cms/10.1289/ehp.1206273/asset/39789e63-2aba-4625-9782-bd10ec26fcf1/assets/graphic/ehp.1206273.g003.jpg

In [144]:
import math

# Function to calculate the Heat Index (HI)
def calculate_heat_index(T, H):
    # Step 1: Check if temperature is less than or equal to 40°F
    if T <= 40:
        return T

    # Step 2: Calculate A
    A = -10.3 + 1.1 * T + 0.047 * H

    # Step 3: Check if A is less than 79°F
    if A < 79:
        return A

    # Step 4: Calculate B using the full formula
    B = (-42.379 + 2.04901523 * T + 10.14333127 * H
         - 0.22475541 * T * H - 6.83783 * 10**(-3) * T**2
         - 5.481717 * 10**(-2) * H**2 + 1.22874 * 10**(-3) * T**2 * H
         + 8.5282 * 10**(-4) * T * H**2 - 1.99 * 10**(-6) * T**2 * H**2)

    # Step 5: Check specific conditions for further adjustments
    if H <= 13 and 80 <= T <= 112:
        B -= ((13 - H) / 4) * math.sqrt((17 - abs(T - 95)) / 17)
        return B

    if H > 85 and 80 <= T <= 87:
        B += 0.02 * (H - 85) * (87 - T)
        return B

    # Step 6: Default case
    return B

# Example usage
T = 90  # Example temperature in °F
H = 70  # Example relative humidity in %
heat_index = calculate_heat_index(T, H)
print(f"The Heat Index is: {round(heat_index, 2)} °F")
The Heat Index is: 105.92 °F

Now to covert the above algorithm to the celsius measure:

In [146]:
# Function to calculate the Heat Index (HI) with temperature in Celsius
def calculate_heat_index_cel(T_celsius, H):
    # Convert temperature from Celsius to Fahrenheit
    T = (T_celsius * 9/5) + 32
    
    # Step 1: Check if temperature is less than or equal to 40°F
    if T <= 40:
        return T_celsius

    # Step 2: Calculate A
    A = -10.3 + 1.1 * T + 0.047 * H

    # Step 3: Check if A is less than 79°F
    if A < 79:
        return (A - 32) * 5/9  # Convert back to Celsius

    # Step 4: Calculate B using the full formula
    B = (-42.379 + 2.04901523 * T + 10.14333127 * H
         - 0.22475541 * T * H - 6.83783 * 10**(-3) * T**2
         - 5.481717 * 10**(-2) * H**2 + 1.22874 * 10**(-3) * T**2 * H
         + 8.5282 * 10**(-4) * T * H**2 - 1.99 * 10**(-6) * T**2 * H**2)

    # Step 5: Check specific conditions for further adjustments
    if H <= 13 and 80 <= T <= 112:
        B -= ((13 - H) / 4) * math.sqrt((17 - abs(T - 95)) / 17)
        return (B - 32) * 5/9  # Convert back to Celsius

    if H > 85 and 80 <= T <= 87:
        B += 0.02 * (H - 85) * (87 - T)
        return (B - 32) * 5/9  # Convert back to Celsius

    # Step 6: Default case
    return (B - 32) * 5/9  # Convert back to Celsius

# Example usage
T_celsius = 32.2  # Example temperature in °C (equivalent to 90°F)
H = 70  # Example relative humidity in %
heat_index_celsius = calculate_heat_index_cel(T_celsius, H)
print(f"The Heat Index is: {round(heat_index_celsius, 2)} °C")
The Heat Index is: 41.0 °C

To now observe visually how the (American) NWS Heat Index compares to Apparent Temperature:

In [148]:
hourly_dataframe_clean.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25941 entries, 0 to 30308
Data columns (total 25 columns):
 #   Column                                Non-Null Count  Dtype              
---  ------                                --------------  -----              
 0   date                                  25941 non-null  datetime64[ns, UTC]
 1   temperature_2m                        25941 non-null  float32            
 2   relative_humidity_2m                  25941 non-null  float32            
 3   dew_point_2m                          25941 non-null  float32            
 4   apparent_temperature                  25941 non-null  float32            
 5   rain                                  25941 non-null  float32            
 6   pressure_msl                          25941 non-null  float32            
 7   surface_pressure                      25941 non-null  float32            
 8   et0_fao_evapotranspiration            25941 non-null  float32            
 9   vapour_pressure_deficit               25941 non-null  float32            
 10  wind_speed_10m                        25941 non-null  float32            
 11  wind_speed_100m                       25941 non-null  float32            
 12  soil_temperature_0_to_7cm             25941 non-null  float32            
 13  soil_temperature_7_to_28cm            25941 non-null  float32            
 14  soil_moisture_0_to_7cm                25941 non-null  float32            
 15  soil_moisture_7_to_28cm               25941 non-null  float32            
 16  boundary_layer_height                 25941 non-null  float32            
 17  wet_bulb_temperature_2m               25941 non-null  float32            
 18  shortwave_radiation_instant           25941 non-null  float32            
 19  direct_radiation_instant              25941 non-null  float32            
 20  diffuse_radiation_instant             25941 non-null  float32            
 21  direct_normal_irradiance_instant      25941 non-null  float32            
 22  terrestrial_radiation_instant         25941 non-null  float32            
 23  total_column_integrated_water_vapour  25941 non-null  float32            
 24  cloud_cover_mid                       25941 non-null  int64              
dtypes: datetime64[ns, UTC](1), float32(23), int64(1)
memory usage: 2.9 MB
In [149]:
# If you are working with a filtered DataFrame, make a copy to avoid SettingWithCopyWarning
hi_hourly_meteo_data = hourly_dataframe_clean.copy()

# Apply the calculate_heat_index function to each row using .loc
hi_hourly_meteo_data.loc[:, 'calculated_heat_index_cel'] = hi_hourly_meteo_data.apply(
    lambda row: calculate_heat_index_cel(row['temperature_2m'], row['relative_humidity_2m']),
    axis=1
)

# Plot comparison between calculated heat index and apparent temperature
plt.figure(figsize=(10, 6))
plt.plot(hi_hourly_meteo_data['date'], hi_hourly_meteo_data['calculated_heat_index_cel'],
         label='Calculated Heat Index', color='blue')
plt.plot(hi_hourly_meteo_data['date'], hi_hourly_meteo_data['apparent_temperature'],
         label='Apparent Temperature', color='red', linestyle='--')

# Add titles and labels
plt.title('Comparison of Calculated Heat Index Celsius and Apparent Temperature')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)

# Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

The above exhibit does convey an overall conformity between the algorithmic model (Heat Index) and the realised data.

To now visually observe the differential concerning Heat Index and Apparent Temperature:

In [152]:
# Making a dataframe copy to avoid SettingWithCopyWarning
hix_hourly_meteo_data_new = hi_hourly_meteo_data.copy()

# Step 1: Calculate the difference between 'calculated_heat_index' and 'apparent_temperature'
# Use .loc to avoid SettingWithCopyWarning
hix_hourly_meteo_data_new.loc[:, 'heat_index_difference'] = hix_hourly_meteo_data_new['calculated_heat_index_cel'] - hix_hourly_meteo_data_new['apparent_temperature']

# Step 2: Plot the difference
plt.figure(figsize=(10, 6))
plt.plot(hix_hourly_meteo_data_new['date'],
         hix_hourly_meteo_data_new['heat_index_difference'],
         label='Difference (Heat Index - Apparent Temperature)', color='green')

# Add titles and labels
plt.title('Difference Between Calculated Heat Index and Apparent Temperature')
plt.xlabel('Date')
plt.ylabel('Temperature Difference (°C)')
plt.axhline(0, color='black', linestyle='--')  # Horizontal line at y=0 for reference
plt.grid(True)
plt.xticks(rotation=45)
plt.legend()

# Show the plot
plt.tight_layout()  # Adjust layout to make room for rotated x-axis labels
plt.show()
No description has been provided for this image

A 6°C difference in environmental temperature (not body temperature), such as a change in weather, might not be as concerning, though it could still be uncomfortable or require adjustments in clothing or activities.

Comparing the Apparent Temperature Data to the Heat Index Model and a Quantile Regression Model for Apparent Temperature¶

The Heat Index is a widely used metric that combines air temperature and humidity to determine the perceived temperature. Quantile Regression, on the other hand, is a statistical method that models the relationship between variables at different quantiles, allowing for a more nuanced understanding of the relationship between apparent temperature and its influencing factors.

By comparing these models, we can gain insights into their strengths, weaknesses, and applicability in different contexts. The Heat Index Model, while simple and widely used, may have limitations in capturing the full complexity of the relationship between temperature and humidity. Quantile Regression, with its ability to model conditional quantiles, can provide a more detailed understanding of how apparent temperature varies across different percentiles of temperature and humidity.

Furthermore, comparing the models to actual apparent temperature data can help assess their accuracy and identify potential biases. This analysis can inform decision-making in areas such as public health, urban planning, and climate change adaptation, where understanding apparent temperature is crucial.

Overall, this comparison provides valuable insights into the different approaches to modeling apparent temperature and highlights the strengths and limitations of each method. By understanding the nuances of these models, researchers and practitioners can make more informed decisions based on accurate and reliable apparent temperature estimates.

The selected features for apparent_temperature from earlier to be applied.

const -6.6007 0.006 -1143.728 0.000 -6.612 -6.589 temperature_2m 0.7305 0.000 1586.255 0.000 0.730 0.731 wind_speed_100m -0.0007 0.000 -3.469 0.001 -0.001 -0.000 wet_bulb_temperature_2m 0.7992 0.000 1764.586 0.000 0.798 0.800 wind_speed_10m -0.1465 0.000 -591.801 0.000 -0.147 -0.146 et0_fao_evapotranspiration 0.4220 0.002 237.996 0.000 0.419 0.426 boundary_layer_height 4.665e-05 2.49e-06 18.717 0.000 4.18e-05 5.15e-05

In [156]:
from statsmodels.tsa.stattools import coint

# Assuming hix_hourly_meteo_data is loaded as a DataFrame
# Step 1: Feature Engineering the Regression Model
hix_hourly_meteo_data_new['app_heat_predict_mod'] = (
    -6.6007 
    + 0.7305 * hix_hourly_meteo_data_new['temperature_2m'] 
    - 0.0007 * hix_hourly_meteo_data_new['wind_speed_100m'] 
    + 0.7992 * hix_hourly_meteo_data_new['wet_bulb_temperature_2m']
    + 0.4220 * hix_hourly_meteo_data_new['et0_fao_evapotranspiration']
    + 4.665e-05 * hix_hourly_meteo_data_new['boundary_layer_height'] 
)

# Step 2: Plotting the Time Series
plt.figure(figsize=(12, 6))
plt.plot(hix_hourly_meteo_data_new.index, 
         hix_hourly_meteo_data_new['app_heat_predict_mod'],
         label='App Heat Predict Mod')
plt.plot(hix_hourly_meteo_data_new.index,
         hix_hourly_meteo_data_new['apparent_temperature'],
         label='Apparent Temperature')
plt.plot(hix_hourly_meteo_data_new.index,
         hix_hourly_meteo_data_new['calculated_heat_index_cel'],
         label='Calculated Heat Index_cel')
plt.title('Time Series Plot')
plt.xlabel('Time')
plt.ylabel('Heat Feel Values')
plt.legend()
plt.show()

# Step 3: Plotting the Differential
hix_hourly_meteo_data_new['diff_predict_apparent'] = hix_hourly_meteo_data_new['app_heat_predict_mod'] - hix_hourly_meteo_data_new['apparent_temperature']
hix_hourly_meteo_data_new['diff_predict_calculated'] = hix_hourly_meteo_data_new['app_heat_predict_mod'] - hix_hourly_meteo_data_new['calculated_heat_index_cel']
hix_hourly_meteo_data_new['diff_apparent_calculated'] = hix_hourly_meteo_data_new['apparent_temperature'] - hix_hourly_meteo_data_new['calculated_heat_index_cel']

plt.figure(figsize=(12, 6))
plt.plot(hix_hourly_meteo_data_new.index,
         hix_hourly_meteo_data_new['diff_predict_apparent'],
         label='Predicted - Apparent')
plt.plot(hix_hourly_meteo_data_new.index,\
         hix_hourly_meteo_data_new['diff_predict_calculated'],
         label='Predicted - Calculated')
plt.plot(hix_hourly_meteo_data_new.index,
         hix_hourly_meteo_data_new['diff_apparent_calculated'],
         label='Apparent - Calculated')
plt.title('Differential Time Series Plot')
plt.xlabel('Time')
plt.ylabel('Differential Values')
plt.legend()
plt.show()

# Performing Cointegration Tests
coint_test_apparent = coint(hix_hourly_meteo_data_new['app_heat_predict_mod'],
                            hix_hourly_meteo_data_new['apparent_temperature'])
coint_test_calculated = coint(hix_hourly_meteo_data_new['app_heat_predict_mod'],
                              hix_hourly_meteo_data_new['calculated_heat_index_cel'])
coint_test_apparent_calculated = coint(hix_hourly_meteo_data_new['apparent_temperature'],
                                       hix_hourly_meteo_data_new['calculated_heat_index_cel'])

# Outputting the results
print("Cointegration Test Results:")
print("1. App Heat Predict Mod and Apparent Temperature:")
print(f"   - Test Statistic: {coint_test_apparent[0]}")
print(f"   - p-value: {coint_test_apparent[1]}")
print(f"   - Critical Values: {coint_test_apparent[2]}\n")

print("2. App Heat Predict Mod and Calculated Heat Index:")
print(f"   - Test Statistic: {coint_test_calculated[0]}")
print(f"   - p-value: {coint_test_calculated[1]}")
print(f"   - Critical Values: {coint_test_calculated[2]}\n")

print("3. Apparent Temperature and Calculated Heat Index:")
print(f"   - Test Statistic: {coint_test_apparent_calculated[0]}")
print(f"   - p-value: {coint_test_apparent_calculated[1]}")
print(f"   - Critical Values: {coint_test_apparent_calculated[2]}")
No description has been provided for this image
No description has been provided for this image
Cointegration Test Results:
1. App Heat Predict Mod and Apparent Temperature:
   - Test Statistic: -10.07281136448165
   - p-value: 1.626711589561262e-16
   - Critical Values: [-3.89686225 -3.33636556 -3.0446135 ]

2. App Heat Predict Mod and Calculated Heat Index:
   - Test Statistic: -9.903578849491844
   - p-value: 4.346699413691353e-16
   - Critical Values: [-3.89686225 -3.33636556 -3.0446135 ]

3. Apparent Temperature and Calculated Heat Index:
   - Test Statistic: -12.119515906460899
   - p-value: 2.0804774298814206e-21
   - Critical Values: [-3.89686225 -3.33636556 -3.0446135 ]

Based on the three compared raw times series, the three compared differentials time series, and the cointegration results, the quantile regression model is closer to the apparent_temperature attribute than the heat index. Additionally, for high temperatures the heat index may serve as a better gauge for extreme feel sensation; the quantile regression model for cold temperatures.

Observing Hurricanes: A Blend of Wind Speed and Pressure¶

Hurricanes, nature's most destructive forces, are categorized based on their sustained wind speeds and central atmospheric pressure. These two primary parameters provide a reliable measure of a hurricane's intensity and potential for damage.

The Saffir-Simpson Hurricane Wind Scale

The Saffir-Simpson Hurricane Wind Scale is a widely used classification system that categorizes hurricanes into five categories based on their sustained wind speeds. The higher the category, the more destructive the hurricane.

Category 1: 74-95 mph (119-153 km/h); 64-82 kt

Category 2: 96-110 mph (154-177 km/h); 83-95 kt

Category 3: 111-129 mph (178-209 km/h); 96-112 kt

Category 4: 130-156 mph (209-251 km/h); 113-136 kt

Category 5: 157 mph or higher (252 km/h or higher); 137 kt or higher

Atmospheric Pressure - A Silent Indicator

While wind speed is a visible and often dramatic indicator of a hurricane's strength, atmospheric pressure is a less obvious but equally crucial factor. As a hurricane intensifies, its central atmospheric pressure decreases. Lower pressure indicates a stronger storm, as it signifies a more powerful low-pressure system.

Visualizing Hurricane Tracks with Python and Folium Python, a versatile programming language, offers powerful libraries like Folium for creating interactive maps. By combining data on hurricane tracks, wind speeds, and atmospheric pressure, we can visualize the evolution of these storms over time.

Historical hurricane tracks data is acquired from the Climate Mapping for Resilience and Adaptation (CMRA) resource to develop geospatial projects.

VARIABLES IN THE DATA SET:

  1. SID = Storm Identifier

  2. BASIN = Basin (type or category)

  3. SUBBASIN = Subbaisin (type or category)

  4. NAME = Name (name or title)

  5. LAT = Latitude (coordinate)

  6. LON = Longitude (coordinate)

  7. USA_WIND = Maximum Sustained Wind Speed (knots) 0 - 300 kts

  8. USA_PRES = Minimum Sea Level Pressure (millibars) 850 - 1050 mb

  9. year = Year (integer)

  10. month = Month (integer)

  11. day = Day (integer)

  12. Hurricane_Date = Date (preferably to be in datetime format)

NOTE: to keep interest in visuals there will be a limited number of graphs constructed; the project is generally focused on the NYC area.

Data assimilation and cleaning:

In [159]:
import pandas as pd
import os
print(os.getcwd())

hurricane_data = pd.read_csv(r"C:\Users\verlene\Downloads\Historical_Hurricane_Tracks (1).csv")
hurricane_data.info()
C:\Users\verlene
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 696780 entries, 0 to 696779
Data columns (total 13 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   OBJECTID        696780 non-null  int64  
 1   SID             696780 non-null  object 
 2   BASIN           574121 non-null  object 
 3   SUBBASIN        603264 non-null  object 
 4   NAME            696780 non-null  object 
 5   LAT             696780 non-null  float64
 6   LON             696780 non-null  float64
 7   USA_WIND        696780 non-null  int64  
 8   USA_PRES        696780 non-null  int64  
 9   year            696780 non-null  int64  
 10  month           696780 non-null  int64  
 11  day             696780 non-null  int64  
 12  Hurricane_Date  696780 non-null  object 
dtypes: float64(2), int64(6), object(5)
memory usage: 69.1+ MB
In [160]:
#Checking for all unique instances
unique_values = hurricane_data['NAME'].unique()
unique_values
Out[160]:
array(['NOT_NAMED', 'ANN', 'BETTY', ..., 'YAMANEKO', 'MANDOUG', 'DARIAN'],
      dtype=object)
In [161]:
# Dropping null entries
hurricane_data = hurricane_data.dropna()
hurricane_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 574121 entries, 0 to 696779
Data columns (total 13 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   OBJECTID        574121 non-null  int64  
 1   SID             574121 non-null  object 
 2   BASIN           574121 non-null  object 
 3   SUBBASIN        574121 non-null  object 
 4   NAME            574121 non-null  object 
 5   LAT             574121 non-null  float64
 6   LON             574121 non-null  float64
 7   USA_WIND        574121 non-null  int64  
 8   USA_PRES        574121 non-null  int64  
 9   year            574121 non-null  int64  
 10  month           574121 non-null  int64  
 11  day             574121 non-null  int64  
 12  Hurricane_Date  574121 non-null  object 
dtypes: float64(2), int64(6), object(5)
memory usage: 61.3+ MB

Concerns for hurricanes from 1989 to recent.

In [163]:
modern_hurricanes_tracks = hurricane_data[hurricane_data['year'] >= 1989]
modern_hurricanes_tracks
Out[163]:
OBJECTID SID BASIN SUBBASIN NAME LAT LON USA_WIND USA_PRES year month day Hurricane_Date
474096 474097 1988364S17148 SP EA DELILAH -17.82 155.96 35 0 1989 1 1 1989/01/01 05:00:00+00
474097 474098 1988364S17148 SP EA DELILAH -17.93 156.79 40 0 1989 1 1 1989/01/01 05:00:00+00
474098 474099 1988364S17148 SP EA DELILAH -18.07 157.63 45 0 1989 1 1 1989/01/01 05:00:00+00
474099 474100 1988364S17148 SP EA DELILAH -18.19 158.54 45 0 1989 1 1 1989/01/01 05:00:00+00
474100 474101 1988364S17148 SP EA DELILAH -18.33 159.48 45 0 1989 1 1 1989/01/01 05:00:00+00
... ... ... ... ... ... ... ... ... ... ... ... ... ...
696775 696776 2022352S12093 SI MM DARIAN -29.77 68.25 42 1000 2022 12 30 2022/12/30 05:00:00+00
696776 696777 2022352S12093 SI MM DARIAN -30.40 68.20 39 1001 2022 12 31 2022/12/31 05:00:00+00
696777 696778 2022352S12093 SI MM DARIAN -30.99 68.19 0 0 2022 12 31 2022/12/31 05:00:00+00
696778 696779 2022357S13130 SI WA ELLIE -13.30 129.80 39 994 2022 12 22 2022/12/22 05:00:00+00
696779 696780 2022357S13130 SI WA ELLIE -13.75 129.95 37 995 2022 12 22 2022/12/22 05:00:00+00

191038 rows × 13 columns

In [164]:
# Drop duplicates based on 'NAME' and 'Hurricane_Date', keeping the first occurrence
modern_hurricanes_track_unique = modern_hurricanes_tracks.drop_duplicates(subset=['NAME',
                                                                                         'Hurricane_Date'],
                                                                                 keep='first')
# Display the result
print(modern_hurricanes_track_unique)
        OBJECTID            SID BASIN SUBBASIN     NAME    LAT     LON  \
474096    474097  1988364S17148    SP       EA  DELILAH -17.82  155.96   
474104    474105  1988364S17148    SP       MM  DELILAH -19.40  163.15   
474112    474113  1988364S17148    SP       MM  DELILAH -23.25  168.72   
474120    474121  1988364S17148    SP       MM  DELILAH -28.10  170.80   
474128    474129  1988364S17148    SP       MM  DELILAH -32.10  170.50   
...          ...            ...   ...      ...      ...    ...     ...   
696752    696753  2022352S12093    SI       MM   DARIAN -19.50   79.20   
696760    696761  2022352S12093    SI       MM   DARIAN -22.40   73.70   
696768    696769  2022352S12093    SI       MM   DARIAN -26.10   70.20   
696776    696777  2022352S12093    SI       MM   DARIAN -30.40   68.20   
696778    696779  2022357S13130    SI       WA    ELLIE -13.30  129.80   

        USA_WIND  USA_PRES  year  month  day          Hurricane_Date  
474096        35         0  1989      1    1  1989/01/01 05:00:00+00  
474104        55         0  1989      1    2  1989/01/02 05:00:00+00  
474112        55         0  1989      1    3  1989/01/03 05:00:00+00  
474120        45         0  1989      1    4  1989/01/04 05:00:00+00  
474128         0         0  1989      1    5  1989/01/05 05:00:00+00  
...          ...       ...   ...    ...  ...                     ...  
696752        60       989  2022     12   28  2022/12/28 05:00:00+00  
696760        54       994  2022     12   29  2022/12/29 05:00:00+00  
696768        45      1001  2022     12   30  2022/12/30 05:00:00+00  
696776        39      1001  2022     12   31  2022/12/31 05:00:00+00  
696778        39       994  2022     12   22  2022/12/22 05:00:00+00  

[25085 rows x 13 columns]
In [165]:
modern_hurricanes_track_unique.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25085 entries, 474096 to 696778
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   OBJECTID        25085 non-null  int64  
 1   SID             25085 non-null  object 
 2   BASIN           25085 non-null  object 
 3   SUBBASIN        25085 non-null  object 
 4   NAME            25085 non-null  object 
 5   LAT             25085 non-null  float64
 6   LON             25085 non-null  float64
 7   USA_WIND        25085 non-null  int64  
 8   USA_PRES        25085 non-null  int64  
 9   year            25085 non-null  int64  
 10  month           25085 non-null  int64  
 11  day             25085 non-null  int64  
 12  Hurricane_Date  25085 non-null  object 
dtypes: float64(2), int64(6), object(5)
memory usage: 2.7+ MB

K-Means Clustering: A Common Tool for Data Analysis¶

K-Means clustering is a common technique in the realm of unsupervised machine learning, a branch of artificial intelligence that delves into unlabeled data. This powerful algorithm is designed to group similar data points together, making it a versatile tool for a wide range of applications.

At its core, K-Means operates through an iterative process:

  1. Initialization: The algorithm begins by randomly selecting K data points as initial centroids, which serve as the starting points for each cluster.

  2. Assignment: Each data point is assigned to the nearest centroid, forming K distinct clusters.

  3. Update Centroids: The centroids of each cluster are recalculated as the mean of all the points assigned to that cluster.

  4. Iteration: Steps 2 and 3 are repeated until convergence, meaning the centroids no longer shift significantly.

While K-Means is a relatively simple algorithm, its effectiveness hinges on careful consideration of several factors:

  1. Choosing the Right K: Determining the optimal number of clusters (K) is a critical decision. Techniques like the Elbow Method and Silhouette Analysis can aid in this process.

  2. Initialization Sensitivity: The initial random selection of centroids can influence the final clustering results. K-Means++ is a popular technique to mitigate this issue.

  3. Outliers: Outliers can distort the clustering process. Robust K-Means algorithms and outlier detection techniques can help address this challenge.

  4. Scalability: For large datasets, K-Means can become computationally expensive. Mini-Batch K-Means is a scalable alternative that processes data in smaller batches.

Mathematical Structure¶

Given a dataset of $n$ points $X = {x_1,x_2,..,x_n}$ in $\mathbb{R}^d$, the goal is to partition $X$ into $K$ clusters such that each data point is assigned to the nearest cluster center, minimizing the sum of squared distances to the nearest centroid:

1. Centroids Initialization:

Initialize $K$ centroids ${\mu_1,\mu_2,..,\mu_K}$ randoomly from the dataset.

2. Assignment Step:

For each data point $x_i$, assign it to the nearest cluster center:

$$c_{i} = \arg\min_{j \in \{1, 2, \ldots, K\}} \left\| x_{i} - \mu_j \right\|^2$$

where $c_i$ is the cluster assignment $x_i$, and $\mu_j$ represents the centroid of cluster $j$.

3. Update Step:

After assigning each point, recompute the centroid of each cluster by taking the mean of all points assigned to it:

$$\mu_j = \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i$$

where $C_j$ represents the set of points assigned to cluster $j$, and $|C_j|$ is the number of points in cluster $j$.

4. Iterate:

Repeat the assignment and update steps until the centroids converge, which is generally achieved when there is little or no change in the positions of the centroids, or a maximum number of iterations is reached.

OBJECTIVE FUNCTION:

KMeans aims to minimize the Within-Cluster Sum of Squares (WCSS):

$$J = \sum_{j=1}^{K} \sum_{x_i \in C_j} \| x_i - \mu_j \|^2$$

A Demonstration of K-Means Clustering¶

import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs

Generating synthetic data with make_blobs function:

In [170]:
# Generate synthetic data
from sklearn.datasets import make_blobs
n_samples = 1000
n_clusters = 4
X, y_true = make_blobs(n_samples=n_samples, centers=n_clusters, cluster_std=0.60, random_state=42)

# Convert to DataFrame for easier manipulation
data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])

Visualizing the synthetic data before clustering:

In [172]:
# Generate synthetic data
n_samples = 1000
n_clusters = 4
X, y_true = make_blobs(n_samples=n_samples, centers=n_clusters, cluster_std=0.60, random_state=42)

# Convert to DataFrame for easier manipulation
data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])

# Check the first few rows of the generated data
print(data.head())
   Feature_1  Feature_2
0  -8.668355   7.168180
1  -6.434370  -6.700534
2  -6.544631  -6.834506
3   4.364262   1.463263
4   4.484124   1.071284

Visualizing the synthetic data:

In [174]:
# Visualize the generated data before clustering
plt.figure(figsize=(8, 6))
plt.scatter(data['Feature_1'], data['Feature_2'], s=30, color='blue', marker='o')
plt.title('Generated Synthetic Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
plt.show()
No description has been provided for this image

Fitting a K-Means model to the data and predict the cluster for each data point:

In [176]:
# Apply K-Means
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
data['Cluster'] = kmeans.fit_predict(X)

Visualizing the clusters:

In [178]:
plt.figure(figsize=(8, 6))
plt.scatter(data['Feature_1'], data['Feature_2'], c=data['Cluster'], s=30, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')  # Mark the centers
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
plt.show()
No description has been provided for this image

Now, back to the cleaned (real) historical hurricanes data:¶

In [180]:
modern_hurricanes_track_unique.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25085 entries, 474096 to 696778
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   OBJECTID        25085 non-null  int64  
 1   SID             25085 non-null  object 
 2   BASIN           25085 non-null  object 
 3   SUBBASIN        25085 non-null  object 
 4   NAME            25085 non-null  object 
 5   LAT             25085 non-null  float64
 6   LON             25085 non-null  float64
 7   USA_WIND        25085 non-null  int64  
 8   USA_PRES        25085 non-null  int64  
 9   year            25085 non-null  int64  
 10  month           25085 non-null  int64  
 11  day             25085 non-null  int64  
 12  Hurricane_Date  25085 non-null  object 
dtypes: float64(2), int64(6), object(5)
memory usage: 2.7+ MB

MiniBatch K-Means¶

Mini-Batch K-Means is a variation of K-Means that addresses the scalability issue by using smaller subsets of data, called mini-batches, in each iteration. This approach significantly reduces computational cost, especially for large datasets.

Key Characteristics:

  1. Mini-Batch Processing: Processes smaller subsets of data in each iteration.

  2. Faster Convergence: Often converges faster than K-Means, especially for large datasets.

  3. Approximation: Due to the use of mini-batches, it might not converge to the same solution as K-Means, but it often provides a good approximation.

  4. Scalability: More scalable than K-Means for large datasets.

Overview

Mini-Batch K-Means Usage:

  1. Large datasets where computational efficiency is crucial.

  2. When a good approximation of the optimal clustering is sufficient.

  3. Online learning scenarios where data arrives in a stream.

Mathematical Structure¶

MiniBatch KMeans stirives to achieve clustering by updating centroids with small, randomly selected "mini-batches' of the data rather than the complete data set in each iteration.

1. MiniBatch Selection:

In each iteration a random set (mini-batch) of $m$ data points $B = {x_{i1}, x_{i2},..., x_{im}}$ is sampled from the full dataset $X$, where $m<n$.

2. Assignment Step:

For each point in the mini-batch $x_{ik}$, assign it to the nearest cluster center based on the squared Euclidean ditance:

$$c_{ik} = \arg\min_{j \in \{1, 2, \ldots, K\}} \left\| x_{ik} - \mu_j \right\|^2$$

3. Update Step:

For each cluster $j$ represented in the mini-batch, update its centroid $\mu_j$ based on the points assigned to it in the mini-batch. Using incremental mean update:

$$\mu_j = \\mu_j + \eta(x_{ik} - \mu_j)$$

where $\eta$ is the learning rate, generally computed as $\frac{1}{t_j}$, with $t_j$ as the numnber of times cluster $j$ has been updated.

4. Repeat:

Repeat the assignment and update steps until converge criteria are met, typically when centroids stabilize or a set number of iterations are reached.

Performing MiniBatch K-Means Clustering on the (real) historical hurricanes data:

The Elbow Method¶

The elbow method is a technique used to determine the optimal number of clusters (k) in K-Means clustering. It evaluates how the sum of squared distances between data points and their corresponding cluster centroids (called within-cluster sum of squares, WCSS) decreases as the number of clusters increases. The goal is to find the "elbow point," where adding more clusters does not significantly reduce the WCSS.

In K-Means clustering, inertia refers to the within-cluster sum of squares (WCSS). It measures how well the clustering algorithm has grouped the data points within their respective clusters. Specifically, it quantifies how close the data points are to their cluster's centroid.

Definition of Inertia:

Inertia is calculated as the sum of the squared distances between each data point and the centroid of the cluster it belongs to. Mathematically:

$$\text{Inertia} = \sum_{i=1}^{k} \sum_{x \in C_i} \| x - \mu_i \|^2$$

Where:

$k$ is the number of clusters;

$C_i$ is the set of cluster points in cluster $i$;

$x$ is a data point;

$\mu$ is the centroid of the cluster $i$.

Interpretation

  1. Low inertia means that the points are close to their centroids, indicating good clustering.

  2. High inertia means that the points are farther from their centroids, which may indicate that the clustering is not well-fitted.

Inertia decreases when increasing the number of clusters, because the data points are split into smaller groups. However, adding more clusters reduces the inertia at diminishing returns. This is why there's use of the elbow method to balance the trade-off between inertia reduction and model simplicity.

The elbow method involves plotting the inertia (WCSS) against the number of clusters (k). As the number of clusters increases, inertia will decrease. However, after a certain point (the elbow), the reduction in inertia becomes negligible, indicating the optimal number of clusters.

For the Elbow Method one should observe where the drastic drop in the curve falters.

Silhouette Score¶

The silhouette score is a metric used to evaluate the quality of clusters formed by a clustering algorithm like K-Means. It measures how similar an object is to its own cluster compared to other clusters. The silhouette score can range from -1 to 1, where:

A score close to 1 indicates that the object is well-clustered and is close to its own cluster center while being far away from other clusters.

A score close to 0 indicates that the object is on or very close to the decision boundary between two neighboring clusters.

A score close to -1 indicates that the object may have been assigned to the wrong cluoptimal.

ong clu For a single data point $i$, the silhouette score $s(i)$ is defined as:

$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$

Where:

$a(i)$ is the average intra-cluster distance for point $i$;

$b(i)$ is the average inter-cluster distance for p. $i$.

Explanation of the Code

  1. Data Loading and Preparation:

The relevant columns are selected from the dataset, and any missing values are dropped.

  1. Standardization:

Standardizing the data helps K-means perform better by giving each feature equal weight.

  1. Elbow Method and Silhouette Score:

The Elbow Method is used to determine the optimal number of clusters by plotting inertia (the sum of squared distances from each point to its assigned cluster center).

The Silhouette Score provides insight into how well the clusters are defined.

  1. K-means Clustering:

After determining the optimal number of clusters, K-means is fitted to the scaled data, and clusters are assigned.

  1. Visualization:

Clusters are visualized using latitude and longitude for spatial analysis.

  1. Cluster Analysis:

The mean values for each feature in each cluster are calculated to analyze cluster characteristics.

The outputs observed are the centroids (or means) of clusters generated from a K-Means clustering analysis on a dataset that includes variables related to hurricanes, particularly USA_WIND, USA_PRES, LAT, and LON.

Comprehending the data:

  1. Cluster Index: The first column (the index) labeled Cluster represents the different clusters that K-Means has identified in the dataset. Each number (0, 1, 2, 3) corresponds to a different cluster.

  2. Variables:

USA_WIND: This represents the wind speed associated with the hurricanes (measured in miles per hour or another unit). Higher values indicate stronger winds.

USA_PRES: This indicates the atmospheric pressure (measured in millibars or inches of mercury) associated with the hurricanes. Lower pressure is typically associated with more intense storms.

LAT (Latitude): This indicates the latitude where the hurricanes occurred. Latitude values range from -90 (South Pole) to +90 (North Pole).

LON (Longitude): This indicates the longitude of the hurricane's location, with values ranging from -180 to +180 degrees.

In [184]:
import os
os.environ["OMP_NUM_THREADS"] = "2"

from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import folium
from folium.plugins import MarkerCluster
data = modern_hurricanes_track_unique[['USA_WIND', 'USA_PRES', 'LAT', 'LON']].dropna()

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Determine the optimal number of clusters using the Elbow Method
inertia = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    mb_kmeans = MiniBatchKMeans(n_clusters=k, random_state=42, batch_size=1024)  # Increase batch size
    mb_kmeans.fit(scaled_data)
    inertia.append(mb_kmeans.inertia_)
    silhouette_scores.append(silhouette_score(scaled_data, mb_kmeans.labels_))

# Plot the Elbow Curve
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(K_range, inertia, marker='o')
plt.title('Elbow Method for Mini-Batch K-Means')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')

# Plot the Silhouette Scores
plt.subplot(1, 2, 2)
plt.plot(K_range, silhouette_scores, marker='o')
plt.title('Silhouette Scores for Mini-Batch K-Means')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')

plt.tight_layout()
plt.show()

# Choose the number of clusters based on the plots
optimal_k = 6 
mb_kmeans = MiniBatchKMeans(n_clusters=optimal_k, random_state=42, batch_size=1024)  # Increase batch size
data['Cluster'] = mb_kmeans.fit_predict(scaled_data)

# Calculate bounds for each cluster
bounds = data.groupby('Cluster').agg(
    wind_speed_min=('USA_WIND', 'min'),
    wind_speed_max=('USA_WIND', 'max'),
    pressure_min=('USA_PRES', 'min'),
    pressure_max=('USA_PRES', 'max')
).reset_index()

# Display the bounds
print(f"Bounds for Each Cluster:")
print(bounds)

# Count occurrences in 'Cluster'
cluster_counts = data['Cluster'].value_counts().reset_index()
cluster_counts.columns = ['Cluster', 'Counts']  # Rename columns for clarity
print(cluster_counts)

# Visualize the clusters
plt.figure(figsize=(10, 8))
plt.scatter(data['LON'], data['LAT'], c=data['Cluster'], cmap='viridis', alpha=0.5)
plt.title('Storm Clusters Based on Latitude and Longitude (Mini-Batch K-Means)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.colorbar(label='Cluster')
plt.show()

# Analyze the characteristics of each cluster
cluster_analysis = data.groupby('Cluster').mean()
print(f"Cluster Analysis after EM and SS:")
print(cluster_analysis)

# Explore how homogeneous each cluster is.
cluster_variances = data.groupby('Cluster').var()
print(f"Variances of Clusters after EM and SS:")
print(cluster_variances)


cluster_counts = data.groupby("Cluster").size()
print(f"Cluster Counts After EM and SS:")
print(cluster_counts)


# For a specific feature, plot its distribution across clusters
for column in data.columns:  
    plt.figure(figsize=(8, 6))
    sns.boxplot(x='Cluster', y=column, data=data)
    plt.title(f'Distribution of {column} across clusters')
    plt.show()


from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

# Train a decision tree classifier
clf = DecisionTreeClassifier()
clf.fit(data, mb_kmeans.labels_)
# Plot the tree
plt.figure(figsize=(12, 8))
plot_tree(clf, feature_names=data.columns, filled=True)
plt.show()

# Get feature importances
feature_importances = clf.feature_importances_

# Create a DataFrame to show feature names and their importance
feature_importance_df = pd.DataFrame({
    'Feature': data.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Print feature importance
print(feature_importance_df)

# Visualize feature importances
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'], color='skyblue')
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.title("Feature Importances for Predicting Cluster Labels")
plt.gca().invert_yaxis()
plt.show()


from sklearn.ensemble import RandomForestClassifier
# Train a random forest classifier
rf = RandomForestClassifier()
rf.fit(data, mb_kmeans.labels_)

# Get feature importances
importances = rf.feature_importances_
sorted_indices = np.argsort(importances)[::-1]

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.title("Feature Importance (Random Forest)")
plt.bar(range(data.shape[1]), importances[sorted_indices], align="center")
plt.xticks(range(data.shape[1]), np.array(data.columns)[sorted_indices], rotation=90)
plt.tight_layout()
plt.show()
No description has been provided for this image
Bounds for Each Cluster:
   Cluster  wind_speed_min  wind_speed_max  pressure_min  pressure_max
0        0              30             150             0             0
1        1              10              70           966          1014
2        2               0             120             0          1021
3        3               0              55             0             0
4        4               0             100             0          1014
5        5              70             165             0           990
   Cluster  Counts
0        3    8438
1        2    4814
2        1    4261
3        4    3288
4        0    2541
5        5    1743
No description has been provided for this image
Cluster Analysis after EM and SS:
           USA_WIND    USA_PRES        LAT         LON
Cluster                                               
0         75.591893    0.000000   9.133184  124.719646
1         34.062661  997.228820  17.420854  128.443602
2         41.525343  900.859576  13.580511 -123.049342
3         15.509125    0.000000   1.858607  118.171410
4         39.358273  991.329075 -16.527248  101.829419
5        103.876649  941.353414  12.191733   97.574819
Variances of Clusters after EM and SS:
           USA_WIND      USA_PRES         LAT          LON
Cluster                                                   
0        659.093619      0.000000  378.297264  1251.944570
1        210.182223     94.176033   66.615281   920.150782
2        463.094620  88182.055490  123.553808   433.829394
3        223.487590      0.000000  463.165816  2287.627721
4        331.382402   2555.716441   34.569747  2022.387449
5        423.557107   4423.740699  234.885437  6802.055453
Cluster Counts After EM and SS:
Cluster
0    2541
1    4261
2    4814
3    8438
4    3288
5    1743
dtype: int64
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
    Feature  Importance
4   Cluster    0.735529
1  USA_PRES    0.264067
2       LAT    0.000404
0  USA_WIND    0.000000
3       LON    0.000000
No description has been provided for this image
No description has been provided for this image

Counts for Hurricane Categories¶

In [186]:
import pandas as pd

# Define a function to categorize based on the Saffir-Simpson Hurricane Wind Scale (in knots)
def categorize_hurricane(wind_speed):
    if 64 <= wind_speed <= 82:
        return 'Category 1'
    elif 83 <= wind_speed <= 95:
        return 'Category 2'
    elif 96 <= wind_speed <= 112:
        return 'Category 3'
    elif 113 <= wind_speed <= 136:
        return 'Category 4'
    elif wind_speed >= 137:
        return 'Category 5'
    else:
        return 'Tropical Storm or Lower'


modern_hurricanes_track_unique = modern_hurricanes_track_unique.copy()

# Apply the categorization function using .loc
modern_hurricanes_track_unique['Category'] = modern_hurricanes_track_unique.loc[:, 'USA_WIND'].apply(categorize_hurricane)

# Group by year and category to count occurrences
category_counts = modern_hurricanes_track_unique.groupby(['year', 'Category']).size().unstack(fill_value=0)

# Display the counts
print(category_counts)
Category  Category 1  Category 2  Category 3  Category 4  Category 5  \
year                                                                   
1989              88          30          25          24           3   
1990              88          48          29          26           3   
1991              70          44          35          27           4   
1992              97          59          34          45           5   
1993              66          44          19          21           0   
1994              82          41          36          40           6   
1995              51          26          22          14           6   
1996              77          39          21          30           4   
1997              93          40          30          33          21   
1998              61          37          14          17           5   
1999              37          23          17          15           1   
2000              67          26          18          15           2   
2001              70          33          19          13           2   
2002              55          40          22          36           7   
2003              60          28          20          29           3   
2004              62          26          35          35           8   
2005              51          30          34          22           5   
2006              58          30          22          28           6   
2007              52          18          14          18           2   
2008              49          23          16          12           0   
2009              41          22          19          20           7   
2010              33          13          16           8           4   
2011              44          24          15          15           1   
2012              48          31          24          21           2   
2013              56          30          10          17           6   
2014              55          29          22          19           9   
2015              73          47          41          39           8   
2016              59          36          18          18           8   
2017              41          28           5           9           0   
2018              63          42          41          36          12   
2019              66          36          22          27           3   
2020              38          16          10          13           2   
2021              36          17          12          18           5   
2022              38          19          15          11           2   

Category  Tropical Storm or Lower  
year                               
1989                          682  
1990                          728  
1991                          634  
1992                          758  
1993                          666  
1994                          805  
1995                          560  
1996                          779  
1997                          829  
1998                          555  
1999                          508  
2000                          606  
2001                          504  
2002                          510  
2003                          615  
2004                          568  
2005                          523  
2006                          546  
2007                          513  
2008                          574  
2009                          629  
2010                          412  
2011                          500  
2012                          544  
2013                          583  
2014                          610  
2015                          663  
2016                          509  
2017                          515  
2018                          675  
2019                          641  
2020                          553  
2021                          609  
2022                          394  
In [187]:
unique_names = sorted(modern_hurricanes_track_unique['BASIN'].unique())
print(unique_names)
['EP', 'NI', 'SA', 'SI', 'SP', 'WP']

Interestingly, no observation of cases to have a decent sample set for the atltantic.

Mann-Whitney Test for the Different Observed Basins¶

Concerns identifying any difference in basins for storms for years 1989 to 2022.

In [190]:
import pandas as pd
from scipy.stats import mannwhitneyu
import itertools

# Unique BASIN instances
basins = ['EP', 'NI', 'SA', 'SI', 'SP', 'WP']

# Function to perform Mann-Whitney U Test
def mann_whitney_test(df, column, basin1, basin2):
    # Filter the data for each basin
    data1 = modern_hurricanes_track_unique[modern_hurricanes_track_unique['BASIN'] == basin1][column]
    data2 = modern_hurricanes_track_unique[modern_hurricanes_track_unique['BASIN'] == basin2][column]
    
    # Check if both groups have data
    if len(data1) > 0 and len(data2) > 0:
        # Perform the Mann-Whitney U test
        stat, p_value = mannwhitneyu(data1, data2, alternative='two-sided')
        mean_diff = data1.mean() - data2.mean()
        median_diff = data1.median() - data2.median()
        greater_mean = basin1 if mean_diff > 0 else basin2
        greater_median = basin1 if median_diff > 0 else basin2
        return stat, p_value, mean_diff, median_diff, greater_mean, greater_median
    else:
        # Return None for each expected output if one of the groups is empty
        return None, None, None, None, None, None

# Store results in lists and create DataFrames later
results_wind = []
results_pres = []

# Perform Mann-Whitney U Test for each combination of BASIN pairs
for basin1, basin2 in itertools.combinations(basins, 2):
    # Test for USA_WIND
    result_wind = mann_whitney_test(modern_hurricanes_track_unique, 'USA_WIND', basin1, basin2)
    if result_wind[0] is not None:  # Check if the test was valid (not None)
        stat_wind, p_value_wind, mean_diff_wind, median_diff_wind, greater_mean_wind, greater_median_wind = result_wind
        results_wind.append({
            'BASIN1': basin1,
            'BASIN2': basin2,
            'Statistic': stat_wind,
            'P_Value': p_value_wind,
            'Mean_Difference': mean_diff_wind,
            'Median_Difference': median_diff_wind,
            'Greater_Mean': greater_mean_wind,
            'Greater_Median': greater_median_wind
        })

    # Test for USA_PRES
    result_pres = mann_whitney_test(modern_hurricanes_track_unique, 'USA_PRES', basin1, basin2)
    if result_pres[0] is not None:  # Check if the test was valid (not None)
        stat_pres, p_value_pres, mean_diff_pres, median_diff_pres, greater_mean_pres, greater_median_pres = result_pres
        results_pres.append({
            'BASIN1': basin1,
            'BASIN2': basin2,
            'Statistic': stat_pres,
            'P_Value': p_value_pres,
            'Mean_Difference': mean_diff_pres,
            'Median_Difference': median_diff_pres,
            'Greater_Mean': greater_mean_pres,
            'Greater_Median': greater_median_pres
        })

# Convert lists to DataFrames
results_wind_df = pd.DataFrame(results_wind)
results_pres_df = pd.DataFrame(results_pres)

# Display results
print("Mann-Whitney U Test Results for USA_WIND:")
print(results_wind_df)

print("\nMann-Whitney U Test Results for USA_PRES:")
print(results_pres_df)
Mann-Whitney U Test Results for USA_WIND:
   BASIN1 BASIN2   Statistic       P_Value  Mean_Difference  \
0      EP     NI   4766626.5  1.076687e-87        14.322577   
1      EP     SA     39912.0  1.076877e-01        13.984961   
2      EP     SI  17981544.5  1.722204e-92         9.478192   
3      EP     SP   8876503.0  3.805687e-64         9.685972   
4      EP     WP  27700667.0  8.222761e-57         3.853414   
5      NI     SA      8079.5  3.614357e-01        -0.337616   
6      NI     SI   3991975.0  3.357271e-07        -4.844385   
7      NI     SP   1984075.5  1.668213e-05        -4.636605   
8      NI     WP   5998358.0  5.722978e-22       -10.469163   
9      SA     SI     39799.5  8.975860e-01        -4.506769   
10     SA     SP     19586.5  9.117688e-01        -4.298989   
11     SA     WP     61049.0  8.105968e-01       -10.131547   
12     SI     SP   8928089.5  6.932966e-01         0.207780   
13     SI     WP  27171336.5  1.386588e-14        -5.624778   
14     SP     WP  13317750.0  6.151791e-11        -5.832558   

    Median_Difference Greater_Mean Greater_Median  
0                10.0           EP             EP  
1                 5.0           EP             EP  
2                 5.0           EP             EP  
3                 5.0           EP             EP  
4                 5.0           EP             EP  
5                -5.0           SA             SA  
6                -5.0           SI             SI  
7                -5.0           SP             SP  
8                -5.0           WP             WP  
9                 0.0           SI             SI  
10                0.0           SP             SP  
11                0.0           WP             WP  
12                0.0           SI             SP  
13                0.0           WP             WP  
14                0.0           WP             WP  

Mann-Whitney U Test Results for USA_PRES:
   BASIN1 BASIN2   Statistic        P_Value  Mean_Difference  \
0      EP     NI   5455091.5  1.859533e-211       420.903597   
1      EP     SA     26792.0   3.266775e-01       -79.629684   
2      EP     SI  23872854.5   0.000000e+00       487.792406   
3      EP     SP  12098287.0   0.000000e+00       536.578276   
4      EP     WP  38617904.0   0.000000e+00       466.059056   
5      NI     SA      2789.5   3.071658e-06      -500.533280   
6      NI     SI   4785579.5   4.241358e-10        66.888809   
7      NI     SP   2476323.5   3.001844e-19       115.674679   
8      NI     WP   7699261.5   2.254912e-08        45.155459   
9      SA     SI     69987.5   4.917170e-08       567.422090   
10     SA     SP     35237.5   4.363927e-09       616.207960   
11     SA     WP    112926.0   1.275861e-07       545.688740   
12     SI     SP   9377982.0   1.774320e-06        48.785870   
13     SI     WP  28929909.0   1.473481e-01       -21.733350   
14     SP     WP  13458892.5   3.829466e-10       -70.519220   

    Median_Difference Greater_Mean Greater_Median  
0                74.0           EP             EP  
1                 0.0           SA             SA  
2              1003.0           EP             EP  
3              1003.0           EP             EP  
4              1003.0           EP             EP  
5               -74.0           SA             SA  
6               929.0           NI             NI  
7               929.0           NI             NI  
8               929.0           NI             NI  
9              1003.0           SA             SA  
10             1003.0           SA             SA  
11             1003.0           SA             SA  
12                0.0           SI             SP  
13                0.0           WP             WP  
14                0.0           WP             WP  

For the print outs of the Greater_Mean and Greater_Median such are just the declaration which basin has the larger value. Say, for USA_WIND, the case of EP versus NI, both the mean and median for EP are larger than those for NI, so EP shows in the first row in both cases for print out. Likewise for USA_PRES.

Hurricane Analysis for the Atlantic Basin¶

After numerous searches for access to hurricane history for the Atlantic basin, data was acquired from a Kaggle repository. However, such data history terminates at year 2015. Such is at least a 9 year gap in modern data. However, for the case of New York, such is still meaningful, since the last hurricane (remnant) to influence New York was IDA.

The National Hurricane Center (NHC) conducts a post-storm analysis of each tropical cyclone in the Atlantic basin (i.e., North Atlantic Ocean, Gulf of Mexico, and Caribbean Sea) and and the North Pacific Ocean to determine the official assessment of the cyclone's history. This analysis makes use of all available observations, including those that may not have been available in real time. In addition, NHC conducts ongoing reviews of any retrospective tropical cyclone analyses brought to its attention and on a regular basis updates the historical record to reflect changes introduced.

To now commence with the data assimilation, data wrangling and feature engineering:

In [193]:
import numpy as np
import pandas as pd

import kagglehub

# Download the dataset
path = kagglehub.dataset_download("noaa/hurricane-database")
print("Path to dataset files:", path)
Warning: Looks like you're using an outdated `kagglehub` version, please consider updating (latest version: 0.3.12)
Path to dataset files: C:\Users\verlene\.cache\kagglehub\datasets\noaa\hurricane-database\versions\1
In [194]:
import pandas as pd
import os

# Specify the directory path
dir_path = r"C:\Users\verlene\.cache\kagglehub\datasets\noaa\hurricane-database\versions\1"

# Load 'atlantic.csv' or 'pacific.csv' into a pandas DataFrame
atlantic_path = os.path.join(dir_path, "atlantic.csv")

# Load the Atlantic data
df_atlantic = pd.read_csv(atlantic_path)
print("Atlantic Data:")
print(df_atlantic.head())
Atlantic Data:
         ID                 Name      Date  Time Event Status Latitude  \
0  AL011851              UNNAMED  18510625     0           HU    28.0N   
1  AL011851              UNNAMED  18510625   600           HU    28.0N   
2  AL011851              UNNAMED  18510625  1200           HU    28.0N   
3  AL011851              UNNAMED  18510625  1800           HU    28.1N   
4  AL011851              UNNAMED  18510625  2100     L     HU    28.2N   

  Longitude  Maximum Wind  Minimum Pressure  ...  Low Wind SW  Low Wind NW  \
0     94.8W            80              -999  ...         -999         -999   
1     95.4W            80              -999  ...         -999         -999   
2     96.0W            80              -999  ...         -999         -999   
3     96.5W            80              -999  ...         -999         -999   
4     96.8W            80              -999  ...         -999         -999   

   Moderate Wind NE  Moderate Wind SE  Moderate Wind SW  Moderate Wind NW  \
0              -999              -999              -999              -999   
1              -999              -999              -999              -999   
2              -999              -999              -999              -999   
3              -999              -999              -999              -999   
4              -999              -999              -999              -999   

   High Wind NE  High Wind SE  High Wind SW  High Wind NW  
0          -999          -999          -999          -999  
1          -999          -999          -999          -999  
2          -999          -999          -999          -999  
3          -999          -999          -999          -999  
4          -999          -999          -999          -999  

[5 rows x 22 columns]
In [195]:
# Drop rows where 'Minimum Pressure' equals -999
df_atlantic = df_atlantic[df_atlantic['Minimum Pressure'] != -999]

# Check the updated data
print("Atlantic Data (after dropping -999 values):")
print(df_atlantic)
Atlantic Data (after dropping -999 values):
             ID                 Name      Date  Time Event Status Latitude  \
127    AL011852              UNNAMED  18520826   600     L     HU    30.2N   
252    AL031853              UNNAMED  18530903  1200           HU    19.7N   
346    AL031854              UNNAMED  18540907  1200           HU    28.0N   
351    AL031854              UNNAMED  18540908  1800           HU    31.6N   
352    AL031854              UNNAMED  18540908  2000     L     HU    31.7N   
...         ...                  ...       ...   ...   ...    ...      ...   
49100  AL122015                 KATE  20151112  1200           EX    41.3N   
49101  AL122015                 KATE  20151112  1800           EX    41.9N   
49102  AL122015                 KATE  20151113     0           EX    41.5N   
49103  AL122015                 KATE  20151113   600           EX    40.8N   
49104  AL122015                 KATE  20151113  1200           EX    40.7N   

      Longitude  Maximum Wind  Minimum Pressure  ...  Low Wind SW  \
127       88.6W           100               961  ...         -999   
252       56.2W           130               924  ...         -999   
346       78.6W           110               938  ...         -999   
351       81.1W           100               950  ...         -999   
352       81.1W           100               950  ...         -999   
...         ...           ...               ...  ...          ...   
49100     50.4W            55               981  ...          180   
49101     49.9W            55               983  ...          180   
49102     49.2W            50               985  ...          200   
49103     47.5W            45               985  ...          180   
49104     45.4W            45               987  ...          150   

       Low Wind NW  Moderate Wind NE  Moderate Wind SE  Moderate Wind SW  \
127           -999              -999              -999              -999   
252           -999              -999              -999              -999   
346           -999              -999              -999              -999   
351           -999              -999              -999              -999   
352           -999              -999              -999              -999   
...            ...               ...               ...               ...   
49100          120               120               120                60   
49101          120               120               120                60   
49102          220               120               120                60   
49103          220                 0                 0                 0   
49104          220                 0                 0                 0   

       Moderate Wind NW  High Wind NE  High Wind SE  High Wind SW  \
127                -999          -999          -999          -999   
252                -999          -999          -999          -999   
346                -999          -999          -999          -999   
351                -999          -999          -999          -999   
352                -999          -999          -999          -999   
...                 ...           ...           ...           ...   
49100                 0             0             0             0   
49101                 0             0             0             0   
49102                 0             0             0             0   
49103                 0             0             0             0   
49104                 0             0             0             0   

       High Wind NW  
127            -999  
252            -999  
346            -999  
351            -999  
352            -999  
...             ...  
49100             0  
49101             0  
49102             0  
49103             0  
49104             0  

[18436 rows x 22 columns]
In [196]:
df_atlantic = df_atlantic.dropna()
In [197]:
unique_values = df_atlantic['Name'].unique()
unique_values
Out[197]:
array(['            UNNAMED', '               ABLE',
       '              BAKER', '            CHARLIE',
       '                DOG', '               EASY',
       '                FOX', '             GEORGE',
       '               ITEM', '               KING',
       '               LOVE', '                HOW',
       '                JIG', '              ALICE',
       '            BARBARA', '              CAROL',
       '              DOLLY', '               EDNA',
       '           FLORENCE', '               GAIL',
       '              HAZEL', '              GILDA',
       '             CONNIE', '              DIANE',
       '              EDITH', '              FLORA',
       '             GLADYS', '               IONE',
       '              HILDA', '              JANET',
       '              KATIE', '               ANNA',
       '              BETSY', '              CARLA',
       '               DORA', '              ETHEL',
       '             FLOSSY', '              GRETA',
       '             AUDREY', '             BERTHA',
       '             CARRIE', '             DEBBIE',
       '             ESTHER', '             FRIEDA',
       '              BECKY', '               CLEO',
       '              DAISY', '               ELLA',
       '               FIFI', '              GERDA',
       '             HELENE', '               ILSA',
       '             JANICE', '             ARLENE',
       '             BEULAH', '              CINDY',
       '              DEBRA', '             GRACIE',
       '             HANNAH', '              IRENE',
       '             JUDITH', '               ABBY',
       '             BRENDA', '              DONNA',
       '            FRANCES', '             HATTIE',
       '              JENNY', '               INGA',
       '               ALMA', '              CELIA',
       '              GINNY', '             HELENA',
       '             ISBELL', '              ELENA',
       '            DOROTHY', '              FAITH',
       '             HALLIE', '               INEZ',
       '               LOIS', '              CHLOE',
       '              DORIA', '               FERN',
       '             GINGER', '              HEIDI',
       '              CANDY', '            BLANCHE',
       '            CAMILLE', '                EVE',
       '          FRANCELIA', '              HOLLY',
       '               KARA', '             LAURIE',
       '             MARTHA', '             FELICE',
       '               BETH', '             KRISTY',
       '              LAURA', '              ALPHA',
       '              AGNES', '              BETTY',
       '               DAWN', '              DELTA',
       '               ALFA', '          CHRISTINE',
       '              DELIA', '              ELLEN',
       '               FRAN', '             CARMEN',
       '             ELAINE', '           GERTRUDE',
       '                AMY', '           CAROLINE',
       '              DORIS', '             ELOISE',
       '               FAYE', '              BELLE',
       '             DOTTIE', '            CANDICE',
       '               EMMY', '             GLORIA',
       '              ANITA', '               BABE',
       '              CLARA', '             EVELYN',
       '             AMELIA', '               BESS',
       '               CORA', '            FLOSSIE',
       '               HOPE', '               IRMA',
       '             JULIET', '             KENDRA',
       '                ANA', '                BOB',
       '          CLAUDETTE', '              DAVID',
       '           FREDERIC', '              HENRI',
       '              ALLEN', '             BONNIE',
       '            CHARLEY', '            GEORGES',
       '               EARL', '           DANIELLE',
       '            HERMINE', '               IVAN',
       '             JEANNE', '               KARL',
       '               BRET', '             DENNIS',
       '              EMILY', '              FLOYD',
       '               GERT', '             HARVEY',
       '               JOSE', '            KATRINA',
       '            ALBERTO', '              BERYL',
       '              CHRIS', '              DEBBY',
       '            ERNESTO', '             ALICIA',
       '              BARRY', '            CHANTAL',
       '               DEAN', '             ARTHUR',
       '              CESAR', '              DIANA',
       '            EDOUARD', '             GUSTAV',
       '           HORTENSE', '            ISIDORE',
       '          JOSEPHINE', '              KLAUS',
       '               LILI', '              DANNY',
       '             FABIAN', '             ISABEL',
       '               JUAN', '               KATE',
       '             ANDREW', '            GILBERT',
       '              ISAAC', '               JOAN',
       '              KEITH', '            ALLISON',
       '               ERIN', '              FELIX',
       '          GABRIELLE', '               HUGO',
       '               IRIS', '              JERRY',
       '              KAREN', '              MARCO',
       '               NANA', '              ERIKA',
       '              GRACE', '             GORDON',
       '           HUMBERTO', '               LUIS',
       '            MARILYN', '               NOEL',
       '               OPAL', '              PABLO',
       '            ROXANNE', '          SEBASTIEN',
       '              TANYA', '               KYLE',
       '               BILL', '               ALEX',
       '               LISA', '              MITCH',
       '             NICOLE', '              LENNY',
       '              JOYCE', '             LESLIE',
       '            MICHAEL', '             NADINE',
       '            LORENZO', '           MICHELLE',
       '               OLGA', '          CRISTOBAL',
       '                FAY', '              HANNA',
       '              LARRY', '              MINDY',
       '           NICHOLAS', '             ODETTE',
       '              PETER', '             GASTON',
       '            MATTHEW', '               OTTO',
       '           FRANKLIN', '                TEN',
       '                LEE', '              MARIA',
       '               NATE', '            OPHELIA',
       '           PHILIPPE', '               RITA',
       '           NINETEEN', '               STAN',
       '              TAMMY', '         TWENTY-TWO',
       '              VINCE', '              WILMA',
       '               BETA', '              GAMMA',
       '            EPSILON', '               ZETA',
       '             ANDREA', '             INGRID',
       '            MELISSA', '            FIFTEEN',
       '                IKE', '               OMAR',
       '            SIXTEEN', '             PALOMA',
       '                ONE', '               FRED',
       '              EIGHT', '                IDA',
       '                TWO', '              COLIN',
       '               FIVE', '              FIONA',
       '               IGOR', '              JULIA',
       '              PAULA', '            RICHARD',
       '              SHARY', '              TOMAS',
       '                DON', '              KATIA',
       '               RINA', '               SEAN',
       '               KIRK', '              OSCAR',
       '              PATTY', '             RAFAEL',
       '              SANDY', '               TONY',
       '             DORIAN', '            FERNAND',
       '            GONZALO', '               NINE',
       '            JOAQUIN'], dtype=object)
In [198]:
# Strip leading and trailing whitespace from column names
df_atlantic.columns = df_atlantic.columns.str.strip()
In [199]:
df_atlantic.info()
<class 'pandas.core.frame.DataFrame'>
Index: 18436 entries, 127 to 49104
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ID                18436 non-null  object
 1   Name              18436 non-null  object
 2   Date              18436 non-null  int64 
 3   Time              18436 non-null  int64 
 4   Event             18436 non-null  object
 5   Status            18436 non-null  object
 6   Latitude          18436 non-null  object
 7   Longitude         18436 non-null  object
 8   Maximum Wind      18436 non-null  int64 
 9   Minimum Pressure  18436 non-null  int64 
 10  Low Wind NE       18436 non-null  int64 
 11  Low Wind SE       18436 non-null  int64 
 12  Low Wind SW       18436 non-null  int64 
 13  Low Wind NW       18436 non-null  int64 
 14  Moderate Wind NE  18436 non-null  int64 
 15  Moderate Wind SE  18436 non-null  int64 
 16  Moderate Wind SW  18436 non-null  int64 
 17  Moderate Wind NW  18436 non-null  int64 
 18  High Wind NE      18436 non-null  int64 
 19  High Wind SE      18436 non-null  int64 
 20  High Wind SW      18436 non-null  int64 
 21  High Wind NW      18436 non-null  int64 
dtypes: int64(16), object(6)
memory usage: 3.2+ MB
In [200]:
df_atlantic = df_atlantic.copy()

# Ensure 'Date' column is in string format
df_atlantic['Date'] = df_atlantic['Date'].astype(str)

# Convert 'Time' to string and format it correctly
df_atlantic['Time'] = df_atlantic['Time'].astype(str).str.zfill(4)  # Ensure 4 digits

# Combine 'Date' and 'Time' columns into a single datetime column
df_atlantic['DateTime'] = pd.to_datetime(df_atlantic['Date'] + ' ' + df_atlantic['Time'].str[:2] + ':' + df_atlantic['Time'].str[2:])

# Drop the 'Date' and 'Time' columns from the Atlantic DataFrame
df_atlantic = df_atlantic.drop(columns=['Date', 'Time'])

# Remove any non-numeric characters (except for digits, '.' and '-') from Latitude and Longitude columns
df_atlantic['Latitude'] = df_atlantic['Latitude'].str.replace(r'[^\d.-]', '', regex=True)
df_atlantic['Longitude'] = df_atlantic['Longitude'].str.replace(r'[^\d.-]', '', regex=True)

# Convert the Longitude column to numeric, coercing errors to NaN
df_atlantic['Longitude'] = pd.to_numeric(df_atlantic['Longitude'], errors='coerce')

# Convert the Latitude column to numeric, coercing errors to NaN
df_atlantic['Latitude'] = pd.to_numeric(df_atlantic['Latitude'], errors='coerce')

# Update the Longitude column to be negative for positive values
df_atlantic['Longitude'] = df_atlantic['Longitude'].apply(lambda x: -abs(x) if x > 0 else x)


df_atlantic.info()
<class 'pandas.core.frame.DataFrame'>
Index: 18436 entries, 127 to 49104
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   ID                18436 non-null  object        
 1   Name              18436 non-null  object        
 2   Event             18436 non-null  object        
 3   Status            18436 non-null  object        
 4   Latitude          18436 non-null  float64       
 5   Longitude         18436 non-null  float64       
 6   Maximum Wind      18436 non-null  int64         
 7   Minimum Pressure  18436 non-null  int64         
 8   Low Wind NE       18436 non-null  int64         
 9   Low Wind SE       18436 non-null  int64         
 10  Low Wind SW       18436 non-null  int64         
 11  Low Wind NW       18436 non-null  int64         
 12  Moderate Wind NE  18436 non-null  int64         
 13  Moderate Wind SE  18436 non-null  int64         
 14  Moderate Wind SW  18436 non-null  int64         
 15  Moderate Wind NW  18436 non-null  int64         
 16  High Wind NE      18436 non-null  int64         
 17  High Wind SE      18436 non-null  int64         
 18  High Wind SW      18436 non-null  int64         
 19  High Wind NW      18436 non-null  int64         
 20  DateTime          18436 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(14), object(4)
memory usage: 3.1+ MB
In [201]:
# Extract the year from the DateTime column
df_atlantic['year'] = df_atlantic['DateTime'].dt.year
df_atlantic['Month'] = df_atlantic['DateTime'].dt.month
df_atlantic['Day'] = df_atlantic['DateTime'].dt.day

# Filter modern hurricanes from 1980 onwards
modern_hurricanes_tracks = df_atlantic[df_atlantic['year'] >= 1980]

modern_hurricanes_tracks.info()
<class 'pandas.core.frame.DataFrame'>
Index: 14593 entries, 33704 to 49104
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   ID                14593 non-null  object        
 1   Name              14593 non-null  object        
 2   Event             14593 non-null  object        
 3   Status            14593 non-null  object        
 4   Latitude          14593 non-null  float64       
 5   Longitude         14593 non-null  float64       
 6   Maximum Wind      14593 non-null  int64         
 7   Minimum Pressure  14593 non-null  int64         
 8   Low Wind NE       14593 non-null  int64         
 9   Low Wind SE       14593 non-null  int64         
 10  Low Wind SW       14593 non-null  int64         
 11  Low Wind NW       14593 non-null  int64         
 12  Moderate Wind NE  14593 non-null  int64         
 13  Moderate Wind SE  14593 non-null  int64         
 14  Moderate Wind SW  14593 non-null  int64         
 15  Moderate Wind NW  14593 non-null  int64         
 16  High Wind NE      14593 non-null  int64         
 17  High Wind SE      14593 non-null  int64         
 18  High Wind SW      14593 non-null  int64         
 19  High Wind NW      14593 non-null  int64         
 20  DateTime          14593 non-null  datetime64[ns]
 21  year              14593 non-null  int32         
 22  Month             14593 non-null  int32         
 23  Day               14593 non-null  int32         
dtypes: datetime64[ns](1), float64(2), int32(3), int64(14), object(4)
memory usage: 2.6+ MB

Applying the Saffir-Simpson Hurricane Wind Scale:

In [203]:
def categorize_hurricane(wind_speed):
    if wind_speed >= 64 and wind_speed <= 82:
        return 'Category 1'
    elif wind_speed >= 83 and wind_speed <= 95:
        return 'Category 2'
    elif wind_speed >= 96 and wind_speed <= 112:
        return 'Category 3'
    elif wind_speed >= 113 and wind_speed <= 136:
        return 'Category 4'
    elif wind_speed > 137:
        return 'Category 5'
    else:
        return 'Not a Hurricane'  # For wind speeds below 74 mph

modern_hurricanes_tracks = modern_hurricanes_tracks.copy()

modern_hurricanes_tracks['HurricaneCategory'] = modern_hurricanes_tracks['Maximum Wind'].apply(categorize_hurricane)

# Display the updated DataFrame with the new 'HurricaneCategory' column
print(modern_hurricanes_tracks[['Name', 'Maximum Wind', 'HurricaneCategory']])
                      Name  Maximum Wind HurricaneCategory
33704                ALLEN            30   Not a Hurricane
33705                ALLEN            30   Not a Hurricane
33706                ALLEN            30   Not a Hurricane
33707                ALLEN            30   Not a Hurricane
33708                ALLEN            35   Not a Hurricane
...                    ...           ...               ...
49100                 KATE            55   Not a Hurricane
49101                 KATE            55   Not a Hurricane
49102                 KATE            50   Not a Hurricane
49103                 KATE            45   Not a Hurricane
49104                 KATE            45   Not a Hurricane

[14593 rows x 3 columns]
In [204]:
print(modern_hurricanes_tracks['HurricaneCategory'].isna().sum())
print(modern_hurricanes_tracks['HurricaneCategory'].unique())
0
['Not a Hurricane' 'Category 1' 'Category 2' 'Category 3' 'Category 4'
 'Category 5']
In [205]:
modern_hurricanes_tracks.info()
<class 'pandas.core.frame.DataFrame'>
Index: 14593 entries, 33704 to 49104
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   ID                 14593 non-null  object        
 1   Name               14593 non-null  object        
 2   Event              14593 non-null  object        
 3   Status             14593 non-null  object        
 4   Latitude           14593 non-null  float64       
 5   Longitude          14593 non-null  float64       
 6   Maximum Wind       14593 non-null  int64         
 7   Minimum Pressure   14593 non-null  int64         
 8   Low Wind NE        14593 non-null  int64         
 9   Low Wind SE        14593 non-null  int64         
 10  Low Wind SW        14593 non-null  int64         
 11  Low Wind NW        14593 non-null  int64         
 12  Moderate Wind NE   14593 non-null  int64         
 13  Moderate Wind SE   14593 non-null  int64         
 14  Moderate Wind SW   14593 non-null  int64         
 15  Moderate Wind NW   14593 non-null  int64         
 16  High Wind NE       14593 non-null  int64         
 17  High Wind SE       14593 non-null  int64         
 18  High Wind SW       14593 non-null  int64         
 19  High Wind NW       14593 non-null  int64         
 20  DateTime           14593 non-null  datetime64[ns]
 21  year               14593 non-null  int32         
 22  Month              14593 non-null  int32         
 23  Day                14593 non-null  int32         
 24  HurricaneCategory  14593 non-null  object        
dtypes: datetime64[ns](1), float64(2), int32(3), int64(14), object(5)
memory usage: 2.7+ MB
In [206]:
# Define the mapping for hurricane categories
category_mapping = {
    'Not a Hurricane': 0,
    'Category 1': 1,
    'Category 2': 2,
    'Category 3': 3,
    'Category 4': 4,
    'Category 5': 5
}

# Create a new column for the ordinal coding
modern_hurricanes_tracks['HurricaneCategoryOrdinal'] = modern_hurricanes_tracks['HurricaneCategory'].map(category_mapping)

# Display the updated DataFrame with the new ordinal column
print(modern_hurricanes_tracks[['HurricaneCategory', 'HurricaneCategoryOrdinal']].head(30))
      HurricaneCategory  HurricaneCategoryOrdinal
33704   Not a Hurricane                         0
33705   Not a Hurricane                         0
33706   Not a Hurricane                         0
33707   Not a Hurricane                         0
33708   Not a Hurricane                         0
33709   Not a Hurricane                         0
33710   Not a Hurricane                         0
33711   Not a Hurricane                         0
33712        Category 1                         1
33713        Category 1                         1
33714        Category 1                         1
33715        Category 2                         2
33716        Category 3                         3
33717        Category 4                         4
33718        Category 4                         4
33719        Category 4                         4
33720        Category 5                         5
33721        Category 5                         5
33722        Category 5                         5
33723        Category 5                         5
33724        Category 5                         5
33725        Category 4                         4
33726        Category 4                         4
33727        Category 4                         4
33728        Category 4                         4
33729        Category 5                         5
33730        Category 5                         5
33731        Category 5                         5
33732        Category 5                         5
33733        Category 4                         4
In [207]:
modern_hurricanes_tracks.info()
<class 'pandas.core.frame.DataFrame'>
Index: 14593 entries, 33704 to 49104
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   ID                        14593 non-null  object        
 1   Name                      14593 non-null  object        
 2   Event                     14593 non-null  object        
 3   Status                    14593 non-null  object        
 4   Latitude                  14593 non-null  float64       
 5   Longitude                 14593 non-null  float64       
 6   Maximum Wind              14593 non-null  int64         
 7   Minimum Pressure          14593 non-null  int64         
 8   Low Wind NE               14593 non-null  int64         
 9   Low Wind SE               14593 non-null  int64         
 10  Low Wind SW               14593 non-null  int64         
 11  Low Wind NW               14593 non-null  int64         
 12  Moderate Wind NE          14593 non-null  int64         
 13  Moderate Wind SE          14593 non-null  int64         
 14  Moderate Wind SW          14593 non-null  int64         
 15  Moderate Wind NW          14593 non-null  int64         
 16  High Wind NE              14593 non-null  int64         
 17  High Wind SE              14593 non-null  int64         
 18  High Wind SW              14593 non-null  int64         
 19  High Wind NW              14593 non-null  int64         
 20  DateTime                  14593 non-null  datetime64[ns]
 21  year                      14593 non-null  int32         
 22  Month                     14593 non-null  int32         
 23  Day                       14593 non-null  int32         
 24  HurricaneCategory         14593 non-null  object        
 25  HurricaneCategoryOrdinal  14593 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int32(3), int64(15), object(5)
memory usage: 2.8+ MB

Applying the Mann-Whitney Test for the Different Periods¶

1. Data Filtering:

The DataFrame modern_hurricanes_track is filtered into two periods:

Period 1: 1980 to 1997

Period 2: 1998 to 2015

2. Counting Storms:

A function count_storms is defined to count the number of storms per year in each period using groupby.

3. Mann-Whitney U Test Function:

A function mann_whitney_test is defined to perform the Mann-Whitney U Test and calculate the required statistics (statistic, p-value, mean difference, median difference).

4. Performing the Test:

The test is conducted between the two periods, and results are displayed.

In [209]:
from scipy.stats import mannwhitneyu
import itertools
import pandas as pd

# Assuming modern_hurricanes_track is already defined and has a 'year' column
# Filter the data for the two periods
period_1 = modern_hurricanes_tracks[(modern_hurricanes_tracks['year'] >= 1980) & (modern_hurricanes_tracks['year'] <= 1997)]
period_2 = modern_hurricanes_tracks[(modern_hurricanes_tracks['year'] >= 1998) & (modern_hurricanes_tracks['year'] <= 2015)]

# Create a function to count storms in each period
def count_storms(df):
    return df.groupby('year').size()

# Count storms for each period
storm_counts_period_1 = count_storms(period_1)
storm_counts_period_2 = count_storms(period_2)

# Function to perform Mann-Whitney U Test
def mann_whitney_test(data1, data2):
    if len(data1) > 0 and len(data2) > 0:
        stat, p_value = mannwhitneyu(data1, data2, alternative='two-sided')
        mean_diff = data2.mean() - data1.mean()
        median_diff = data2.median() - data1.median()
        greater_mean = 'Period 2' if mean_diff > 0 else 'Period 1'
        greater_median = 'Period 2' if median_diff > 0 else 'Period 1'
        return stat, p_value, mean_diff, median_diff, greater_mean, greater_median
    else:
        return None, None, None, None, None, None

# Perform the Mann-Whitney U Test between the two periods
result = mann_whitney_test(storm_counts_period_1, storm_counts_period_2)

# Check if the test was valid (not None) and display results
if result[0] is not None:
    stat, p_value, mean_diff, median_diff, greater_mean, greater_median = result
    print("Mann-Whitney U Test Results:")
    print(f"Statistic: {stat}")
    print(f"P-Value: {p_value}")
    print(f"Mean Difference: {mean_diff}")
    print(f"Median Difference: {median_diff}")
    print(f"Greater Mean: {greater_mean}")
    print(f"Greater Median: {greater_median}")
else:
    print("One of the groups is empty, unable to perform the Mann-Whitney U Test.")
Mann-Whitney U Test Results:
Statistic: 52.0
P-Value: 0.0005313629584409026
Mean Difference: 183.38888888888886
Median Difference: 176.5
Greater Mean: Period 2
Greater Median: Period 2

Clustering based on the the Saffir-Simpson Hurricane Wind Scale (in knots)¶

In [211]:
hurricane_names = ['GLORIA', 'ANDREW', 'FELIX', 'LUIS', 'OPAL',
                   'BERTHA', 'DANNY', 'FLOYD', 'GORON',
                   'ISODORE', 'ISABEL', 'ALEX', 'CHARLEY', 
                   'GASTON', 'FRANCES', 'CINDY', 'KATRINA',
                   'ERNESTO', 'HANNA', 'BILL',
                   'IRENE', 'SANDY', 'ARTHUR', 'MATTHEW', 'GERT',
                   'DORIAN', 'LAURA', 'DELTA', 'ELISA', 'HENRI', 'IDA',
                   'FRANKLIN', 'LEE', 'ERNESTO', 'IRMA', 'GEORGES', 'MARILYN', 'HUGO', 
                   'BERYL', 'TAMMY', 'PHILIPPE', 'FRANKLIN', 'BRET', 'FIONA', 'EARL',
                   'SAM', 'GRACE', 'ELSA', 'JOSEPHINE', 'ISAIAS']

# Filter the DataFrame for these names
Hurricana_influence_popular = modern_hurricanes_tracks[modern_hurricanes_tracks['Name'].str.strip().isin(hurricane_names)]
In [212]:
import pandas as pd
import matplotlib.pyplot as plt
from kmodes.kprototypes import KPrototypes
from sklearn.metrics import silhouette_score
import folium
import seaborn as sns

# Select the relevant columns for clustering

X_start_coor = Hurricana_influence_popular[['Latitude', 'Longitude', 'HurricaneCategoryOrdinal']].copy()

# Step 2: Run K-Prototypes
cost = []
k_values = range(1, 11)

for k in k_values:
    kproto = KPrototypes(n_clusters=k, init='Huang', random_state=42)
    clusters = kproto.fit_predict(X_start_coor, categorical=[2])  # Categorical attribute at index 2
    cost.append(kproto.cost_)

# Step 3: Plot the Elbow Curve
plt.figure(figsize=(10, 6))
plt.plot(k_values, cost)
plt.xlabel('Number of Clusters')
plt.ylabel('Cost')
plt.title('Elbow Method for K-Prototypes')
plt.xticks(k_values)
plt.grid(True)
plt.show()

# Calculate Silhouette Scores
silhouette_scores = []

for k in k_values[1:]:
    kproto = KPrototypes(n_clusters=k, init='Huang', random_state=42)
    clusters = kproto.fit_predict(X_start_coor, categorical=[2])
    score = silhouette_score(X_start_coor, clusters)
    silhouette_scores.append(score)

# Step 4: Plot Silhouette Scores
plt.figure(figsize=(10, 6))
plt.plot(k_values[1:], silhouette_scores)
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score Method for K-Prototypes')
plt.xticks(k_values[1:])
plt.grid(True)
plt.show()

# Clustering and Visualization
for optimal_k in [3, 7]:
    kproto = KPrototypes(n_clusters=optimal_k, init='Huang', random_state=42)
    clusters = kproto.fit_predict(X_start_coor, categorical=[2])
    X_start_coor['Cluster'] = clusters.astype(int)

    # Plotting with Matplotlib
    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(X_start_coor['Longitude'], X_start_coor['Latitude'], 
                          c=X_start_coor['Cluster'], cmap='viridis', alpha=0.6, edgecolor='k')
    plt.title(f'Cluster Visualization of Storm Events with k={optimal_k}')
    plt.xlabel('Longitude')
    plt.ylabel('Latitude')
    plt.colorbar(scatter, label='Cluster')
    plt.grid(True)
    plt.show()

    # Optional: Interactive Map Visualization using Folium
    map_clusters = folium.Map(location=[X_start_coor['Latitude'].mean(), 
                                        X_start_coor['Longitude'].mean()], zoom_start=5)
    colors = sns.color_palette("viridis", optimal_k).as_hex()

    for _, row in X_start_coor.iterrows():
        cluster_index = int(row['Cluster'])  
        folium.CircleMarker(
            [row['Latitude'], row['Longitude']],
            radius=5,
            color=colors[cluster_index],
            fill=True,
            fill_color=colors[cluster_index],
            fill_opacity=0.7,
            popup=f"Storm Type: {row['HurricaneCategoryOrdinal']}, Cluster: {row['Cluster']}"
        ).add_to(map_clusters)

    # Display the interactive map (if running in a Jupyter Notebook environment)
    display(map_clusters)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Make this Notebook Trusted to load map: File -> Trust Notebook
No description has been provided for this image
Make this Notebook Trusted to load map: File -> Trust Notebook

Time Series for Hurricanes that Influenced New York and Montserrat from 1980 to 2015¶

NOTE: weak category 1 hurricanes can fluctuate between category 1, tropical storms and tropical depressions with pressure ranging from 1005 mb to 1016 mb. Yet, it's still possible breach such range. Of consequence, the pressure parameter to be set to 1024 to avoid triggering invalid (imaginary) outputs.

In [215]:
# Summary statistics for the 'pressure' column
pressure_summary = Hurricana_influence_popular['Minimum Pressure'].describe()
print(pressure_summary)
count    4220.000000
mean      992.809479
std        19.723971
min       902.000000
25%       985.000000
50%      1000.000000
75%      1007.000000
max      1024.000000
Name: Minimum Pressure, dtype: float64
In [216]:
# Iterate over each unique hurricane and plot individually
for name in Hurricana_influence_popular['Name'].unique():
    # Filter data for the specific hurricane
    hurricane_data = Hurricana_influence_popular[Hurricana_influence_popular['Name'] == name]

    # Plotting
    plt.figure(figsize=(10, 6))
    sns.lineplot(
        x=hurricane_data['Day'],
        y=hurricane_data['Maximum Wind'],
        marker='o',
        label=f'{name} ({hurricane_data["year"].iloc[0]}-{hurricane_data["Month"].iloc[0]})'
    )

    # Add pressure as circle markers with varying size and color based on magnitude
    plt.scatter(
        x=hurricane_data['Day'],
        y=hurricane_data['Maximum Wind'],
        c=hurricane_data['Minimum Pressure'],
        s=(1024 - hurricane_data['Minimum Pressure']) * 2,  # size based on pressure
        cmap='coolwarm',
        alpha=0.7,
        edgecolor='k'
    )

    plt.title(f'Hurricane {name}: Wind Speeds with Pressure Indicators')
    plt.xlabel('Day of Month')
    plt.ylabel('Wind Speed (knots)')
    plt.colorbar(label='Pressure (millibars)')
    plt.legend()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Such prior plots are like "fingerprints" for hurricanes, which can only be acquired from pre-existing data.

Influential Meteorological Phenomena¶

Scales of meteorological phenomena are based on their size (horizontal extent) and duration:

Microscale < 2 km - Seconds to minutes - Tornadoes, gusts, turbulence

Mesoscale 2 – 200 km - Minutes to hours - Thunderstorms, squall lines, sea breezes

Synoptic 200 – 2000+ km - Days to a week - Hurricanes, mid-latitude cyclones, cold fronts

Planetary > 2000 km - Weeks to months - Jet streams, Rossby waves

Mesoscale Meteorological Phenomena:

  1. Thunderstorms (singular, multicell and supercell storms)

  2. Tornados (despite being a small-scale phenomena, they arise in mesoscale environments such as supercell thunderstorms)

  3. Fronts (can influence local weather patterns, especially where larger systems interact)

  4. Sea Breezes and Land Breezes (local wind systems influenced by differential heating of land and water are typical mesoscale phenomena)

  5. Orographic Lifting (being the impact of terrains on wind patterns and preciptation that can lead to mesoscale events)

  6. Squall Lines (long lines of thunderstorms associated with cold fronts)

  7. Drylines (boundaries between different air masses, particularly warms, mosit air and hot dry air, often leading to storm formation)

Importance of Mesoscale Analysis

comprehending mesoscale process is vital for weather forecasting, especially for forecasting severe weather events like thundertorms, hail. tornadoes, and flash flooding.

Classifying Extreme Weather Events¶

For extreme weather events one needs to identify the conditions for such. For a purely tropical ambiance in particular, the attributes of interest are temperature, air pressure (concerning tropical depressions, tropical storms or hurricanes), rainfall level, windspeed nad wind gusts.

To now identify the key parameters for each attribute:

  1. In celsius measure 35°C is generally considered quite hot.
  2. In celsius -15°C degrees is considered quite cold.
  3. The highest air pressure, translating to the weakness recognised tropical cyclones, disturbances or waves, is 1016 hPa; such is generally the upper limit for recognised weakest systems.
  4. For rain fall torrential downpours of at least 30 mm within a hour is considered an extreme event;
  5. For wind speed, at least 70 km/h is considered hazardous; same for wind gusts.

To identify an extreme event in a data point or event (out of 1 through 5 prior), one checks if the current value is extreme compared to recent history (usually a rolling window). Hence, one is tagging the observation with a numeric code representing the type of outlier.

FURTHER CLARIFICATIONS --

  1. If current temperature at 2 meters is ≥ 35°C (heatwave threshold), and the recent rolling average is < 35°C, then it’s considered a sudden spike, not a gradual warming trend.
  2. If current temp is ≤ -15°C (severe cold), but the average isn’t that low — Extreme cold.
  3. Mean sea level pressure has dropped below 1016 hPa, possibly indicating a low-pressure system (e.g., storm), but if the average pressure was higher, this drop is sudden — Possible threatening weather system.
  4. Very high rainfall event (≥ 30 mm) within an hour. Not part of a rainy trend → it's an anomalous downpour — Extreme rainfall.
  5. For wind speed or wind gust, if either of them hit 70 km/h or higher, and this isn’t typical in the rolling window — Extreme wind or gusts.
In [221]:
import openmeteo_requests

import pandas as pd
import requests_cache
from retry_requests import retry

# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)

# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://archive-api.open-meteo.com/v1/archive"
params = {
	"latitude": 16.7425,
	"longitude": -62.1874,
	"start_date": "2022-01-08",
	"end_date": "2025-06-24",
	"hourly": ["temperature_2m", "rain", "wind_speed_10m", "wind_speed_100m", "wind_gusts_10m", "pressure_msl"],
	"timezone": "auto"
}
responses = openmeteo.weather_api(url, params=params)

# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()}{response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")

# Process hourly data. The order of variables needs to be the same as requested.
hourly = response.Hourly()
hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
hourly_rain = hourly.Variables(1).ValuesAsNumpy()
hourly_wind_speed_10m = hourly.Variables(2).ValuesAsNumpy()
hourly_wind_speed_100m = hourly.Variables(3).ValuesAsNumpy()
hourly_wind_gusts_10m = hourly.Variables(4).ValuesAsNumpy()
hourly_pressure_msl = hourly.Variables(5).ValuesAsNumpy()

hourly_data = {"date": pd.date_range(
	start = pd.to_datetime(hourly.Time(), unit = "s", utc = True),
	end = pd.to_datetime(hourly.TimeEnd(), unit = "s", utc = True),
	freq = pd.Timedelta(seconds = hourly.Interval()),
	inclusive = "left"
)}

hourly_data["temperature_2m"] = hourly_temperature_2m
hourly_data["rain"] = hourly_rain
hourly_data["wind_speed_10m"] = hourly_wind_speed_10m
hourly_data["wind_speed_100m"] = hourly_wind_speed_100m
hourly_data["wind_gusts_10m"] = hourly_wind_gusts_10m
hourly_data["pressure_msl"] = hourly_pressure_msl

hourly_dataframe_extreme = pd.DataFrame(data = hourly_data)
print(hourly_dataframe_extreme)
Coordinates 16.76625633239746°N -62.20843505859375°E
Elevation 309.0 m asl
Timezone b'America/Montserrat'b'GMT-4'
Timezone difference to GMT+0 -14400 s
                           date  temperature_2m  rain  wind_speed_10m  \
0     2022-01-08 04:00:00+00:00       23.249001   0.0       28.146843   
1     2022-01-08 05:00:00+00:00       22.598999   0.0       27.255590   
2     2022-01-08 06:00:00+00:00       22.348999   0.0       30.498180   
3     2022-01-08 07:00:00+00:00       21.848999   0.1       28.241076   
4     2022-01-08 08:00:00+00:00       22.098999   0.1       29.215502   
...                         ...             ...   ...             ...   
30331 2025-06-24 23:00:00+00:00             NaN   NaN             NaN   
30332 2025-06-25 00:00:00+00:00             NaN   NaN             NaN   
30333 2025-06-25 01:00:00+00:00             NaN   NaN             NaN   
30334 2025-06-25 02:00:00+00:00             NaN   NaN             NaN   
30335 2025-06-25 03:00:00+00:00             NaN   NaN             NaN   

       wind_speed_100m  wind_gusts_10m  pressure_msl  
0            34.634918       56.160000   1018.500000  
1            33.466450       57.599998   1018.299988  
2            36.707645       60.120003   1017.599976  
3            34.743263       62.279995   1017.500000  
4            35.565376       58.679996   1017.400024  
...                ...             ...           ...  
30331              NaN             NaN           NaN  
30332              NaN             NaN           NaN  
30333              NaN             NaN           NaN  
30334              NaN             NaN           NaN  
30335              NaN             NaN           NaN  

[30336 rows x 7 columns]
In [222]:
hourly_dataframe_extreme_clean = hourly_dataframe_extreme.dropna()
hourly_dataframe_extreme_clean.info()
<class 'pandas.core.frame.DataFrame'>
Index: 30309 entries, 0 to 30308
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   date             30309 non-null  datetime64[ns, UTC]
 1   temperature_2m   30309 non-null  float32            
 2   rain             30309 non-null  float32            
 3   wind_speed_10m   30309 non-null  float32            
 4   wind_speed_100m  30309 non-null  float32            
 5   wind_gusts_10m   30309 non-null  float32            
 6   pressure_msl     30309 non-null  float32            
dtypes: datetime64[ns, UTC](1), float32(6)
memory usage: 1.2 MB
In [223]:
hourly_dataframe_extreme_clean.isna().sum()
Out[223]:
date               0
temperature_2m     0
rain               0
wind_speed_10m     0
wind_speed_100m    0
wind_gusts_10m     0
pressure_msl       0
dtype: int64
In [224]:
# Define the rolling window size
window_size = 30

# Initialize lists to hold outlier types and rule descriptions
outlier_types = []
outlier_rules = []

# Define the rules for each outlier code
outlier_descriptions = {
    0: "Not an extreme event or insufficient data",
    1: "Extreme heat: temperature_2m ≥ 35°C and 30-day mean < 35°C",
    2: "Extreme cold: temperature_2m ≤ -15°C and 30-day mean > -15°C",
    3: "Possible threatening weather system: pressure_msl ≤ 1016 hPa and 30-day mean > 1016 hPa",
    4: "Extreme rainfall: rain ≥ 30 mm and 30-day mean < 30 mm",
    5: "Extreme wind speed (10m): wind_speed_10m ≥ 70 km/h and 30-day mean < 70 km/h",
    6: "Extreme wind speed (100m): wind_speed_100m ≥ 70 km/h and 30-day mean < 70 km/h",
    7: "Extreme wind gusts: wind_gusts_10m ≥ 70 km/h and 30-day mean < 70 km/h"
}

# Iterate through each row to label outliers based on rule
for index, row in hourly_dataframe_extreme_clean.iterrows():
    if index >= window_size - 1:
        temp_window = hourly_dataframe_extreme_clean['temperature_2m'][index - window_size + 1:index + 1]
        pressure_window = hourly_dataframe_extreme_clean['pressure_msl'][index - window_size + 1:index + 1]
        rainfall_window = hourly_dataframe_extreme_clean['rain'][index - window_size + 1:index + 1]
        wind_speed_10m_window = hourly_dataframe_extreme_clean['wind_speed_10m'][index - window_size + 1:index + 1]
        wind_speed_100m_window = hourly_dataframe_extreme_clean['wind_speed_100m'][index - window_size + 1:index + 1]
        wind_gusts_10m_window = hourly_dataframe_extreme_clean['wind_gusts_10m'][index - window_size + 1:index + 1]

        if row['temperature_2m'] >= 35 and (temp_window.mean() < 35):
            code = 1
        elif row['temperature_2m'] <= -15 and (temp_window.mean() > -15):
            code = 2
        elif row['pressure_msl'] <= 1016 and (pressure_window.mean() > 1016):
            code = 3
        elif row['rain'] >= 30 and (rainfall_window.mean() < 30):
            code = 4
        elif row['wind_speed_10m'] >= 70 and (wind_speed_10m_window.mean() < 70):
            code = 5
        elif row['wind_speed_100m'] >= 70 and (wind_speed_100m_window.mean() < 70):
            code = 6
        elif row['wind_gusts_10m'] >= 70 and (wind_gusts_10m_window.mean() < 70):
            code = 7
        else:
            code = 0
    else:
        code = 0  # Not enough data for comparison

    outlier_types.append(code)
    outlier_rules.append(outlier_descriptions[code])

# Assign new columns safely with .loc
hourly_dataframe_extreme_clean = hourly_dataframe_extreme_clean.copy()
hourly_dataframe_extreme_clean.loc[:, 'outlier_type'] = outlier_types
hourly_dataframe_extreme_clean.loc[:, 'outlier_rule'] = outlier_rules

# Print unique rules (excluding "not extreme")
unique_rules = set(outlier_rules)
print("Unique extreme event rules detected:")
for rule in unique_rules:
    if rule != outlier_descriptions[0]:
        print(f"- {rule}")

# Filter extreme events only
extreme_events = hourly_dataframe_extreme_clean[hourly_dataframe_extreme_clean['outlier_type'] != 0].copy()

# Convert timezone-aware datetime to naive and then to float seconds since epoch
time_column = hourly_dataframe_extreme_clean['date'].dt.tz_localize(None).astype('int64') / 1e9

# Adjust time_column safely
time_column.loc[time_column <= 0] += 1e-6

# Assign event observed column safely
hourly_dataframe_extreme_clean.loc[:, 'event_observed'] = hourly_dataframe_extreme_clean['outlier_type'].apply(lambda x: 1 if x != 0 else 0)

# Print the updated DataFrame
print(hourly_dataframe_extreme_clean)
Unique extreme event rules detected:
- Possible threatening weather system: pressure_msl ≤ 1016 hPa and 30-day mean > 1016 hPa
- Extreme wind speed (100m): wind_speed_100m ≥ 70 km/h and 30-day mean < 70 km/h
- Extreme wind gusts: wind_gusts_10m ≥ 70 km/h and 30-day mean < 70 km/h
                           date  temperature_2m  rain  wind_speed_10m  \
0     2022-01-08 04:00:00+00:00       23.249001   0.0       28.146843   
1     2022-01-08 05:00:00+00:00       22.598999   0.0       27.255590   
2     2022-01-08 06:00:00+00:00       22.348999   0.0       30.498180   
3     2022-01-08 07:00:00+00:00       21.848999   0.1       28.241076   
4     2022-01-08 08:00:00+00:00       22.098999   0.1       29.215502   
...                         ...             ...   ...             ...   
30304 2025-06-23 20:00:00+00:00       25.449001   0.0       41.403522   
30305 2025-06-23 21:00:00+00:00       25.648998   0.0       40.892101   
30306 2025-06-23 22:00:00+00:00       25.799000   0.0       42.026817   
30307 2025-06-23 23:00:00+00:00       25.098999   0.0       42.705925   
30308 2025-06-24 00:00:00+00:00       25.549000   0.1       41.760387   

       wind_speed_100m  wind_gusts_10m  pressure_msl  outlier_type  \
0            34.634918       56.160000   1018.500000             0   
1            33.466450       57.599998   1018.299988             0   
2            36.707645       60.120003   1017.599976             0   
3            34.743263       62.279995   1017.500000             0   
4            35.565376       58.679996   1017.400024             0   
...                ...             ...           ...           ...   
30304        46.980347       52.560001   1015.099976             0   
30305        46.474869       52.560001   1015.000000             0   
30306        47.786861       52.560001   1015.500000             0   
30307        48.116932       56.160000   1016.400024             0   
30308        47.160343       54.360001   1017.099976             0   

                                    outlier_rule  event_observed  
0      Not an extreme event or insufficient data               0  
1      Not an extreme event or insufficient data               0  
2      Not an extreme event or insufficient data               0  
3      Not an extreme event or insufficient data               0  
4      Not an extreme event or insufficient data               0  
...                                          ...             ...  
30304  Not an extreme event or insufficient data               0  
30305  Not an extreme event or insufficient data               0  
30306  Not an extreme event or insufficient data               0  
30307  Not an extreme event or insufficient data               0  
30308  Not an extreme event or insufficient data               0  

[30309 rows x 10 columns]

Interpretation of the Extreme Events Detected¶

Caution: the time span of the data can be considered small. However, concerning research, hourly data over various years can be quite computationally expensive. Specifically for the Montserrat territory, by consensus its climate is tropical. Hence, cold temperatures observed in temperate and artic climates are highly inplausible. Extreme high temperature trends not being present can be attributed to Montserrat being a very small land mass island in the Caribbean Sea; highly influenced by coastal or oceanic-atmospheric dynamic.

Unique extreme event rules detected:

  • Possible threatening weather system: pressure_msl ≤ 1016 hPa and 30-day mean > 1016 hPa

  • Extreme wind speed (100m): wind_speed_100m ≥ 70 km/h and 30-day mean < 70 km/h

  • Extreme wind gusts: wind_gusts_10m ≥ 70 km/h and 30-day mean < 70 km/h

Concerning pressure_msl, pressures at 1016 hPa or below are related to Hurricanes, tropical depressions, tropical storms, tropical waves, etc., etc., etc. Extreme wind speeds and extreme wind gusts are heavliy tied to low pressure systems; yet prior observed (Pearson) correlation measures for wind_speed_10m and wind_speed_100m with pressure_msl yielding (036. 0.39, etc.) are a bit disappointing when negative measure in correlation is expected.

Nevertheless, models in physics exonerate an "inverse" relationship between (atmospheric) pressure and wind speed/gust.

Physics Models Relating Atmospheric Pressure and Wind Speed¶

To model the relationship between low atmospheric pressure and high wind speeds/gusts, several fundamental physics models and equations from atmospheric dynamics and fluid mechanics are relevant. These capture the behavior of air flow in response to pressure gradients and the Earth's rotation. However, each model is relevant for specific settings.

1. Pressure Gradient Force (PGF)¶

Being the primary force responsible for wind. Air naturally moves from high-pressure areas to low-pressure areas due to the pressure gradient force:

$$ \vec{F}_{\text{PGF}} = -\frac{1}{\rho} \nabla P $$

$ \vec{F}_{\text{PGF}} $ : Pressure Gradient Force per unit mass (vector)

$ \rho $ : Air density ($\frac{kg}{m^3}$)

$ \nabla P $ : Gradient of pressure (change in pressure over distance)

Air accelerates from high to low pressure; the stronger the pressure gradient, the stronger the resulting force and hence wind.

Such model is generally observed, however, such a model is stronlgy recognised for highly controlled environments like hydraulics, water management in civil engineering, etc.

Geostrophic Wind Equation (Large-Scale, Upper Atmosphere)¶

In large-scale atmospheric flows (away from surface friction), wind tends to balance between the Coriolis force and the pressure gradient force.

$$ \vec{v}_g = \frac{1}{f \rho} \hat{k} \times \nabla P $$

$\vec{v}_g$: Geostrophic wind velocity

$f = 2\Omega \sin \phi$: Coriolis parameter (Earth's rotation rate $\Omega$ and latitude $\phi$)

$\hat{k}$: Unit vector in the vertical direction

Cyclostrophic Wind Equation (Small-Scale such as Hurricanes, Tornadoes)¶

Applicable to small-scale, rapidly rotating low-pressure systems (e.g., tornadoes, tropical cyclones) where Coriolis force is negligible.

$$ \frac{v^2}{r} = \frac{1}{\rho} \frac{dP}{dr} $$

$v$: Wind speed

$r$: Radius from the center of rotation

$\frac{dP}{dr}$: Radial pressure gradient

Gradient Wind Equation (Curved Flow Around Lows)¶

A generalization of geostrophic and cyclostrophic wind, includes both Coriolis and centripetal forces.

$$ \frac{v^2}{r} + fv = \frac{1}{\rho} \frac{dP}{dr} $$

Includes both Coriolis force ($fv$) and centripetal force ($v^2/r$)

Observed is a quadratic equation in $v$ to solve.

Bernoulli’s Principle (Idealized, Steady Flow)¶

In special cases (non-rotating, frictionless, incompressible air), energy conservation applies. In ideal, frictionless, incompressible flow:

$$ \frac{P}{\rho} + \frac{v^2}{2} + gh = \text{constant} $$

$P$: Pressure

$v$: Wind speed

$g$: Gravitational acceleration

$h$: Height

Navier-Stokes Equations (Full Atmospheric Motion - Numerical Modelling)¶

The full motion of air parcels includes pressure gradient, Coriolis, and friction forces. To fully simulate wind, especially in numerical weather prediction models:

$$ \frac{D\vec{v}}{Dt} = -\frac{1}{\rho} \nabla P + \vec{F}_c + \vec{F}_{\text{fric}} $$

$\frac{D\vec{v}}{Dt}$: Material (total) derivative of velocity

$\vec{F}_c$: Coriolis force

$\vec{F}_{\text{fric}}$: Frictional force

The seemingly most promising or applicable models concerning direct data integretion, are the Cyclostrophic Wind Equation and the Gradient Wind Equation. Observation of the respective parameters or attributes, such two models are highly relatable to common meteoroligical data and data for aggressive weather activity such as tropical waves, tropical depressions, cyclones and tornado events. The identified atmospheric pressure - wind speed relationship can be observed.

Multinomial Logistic Model for Storm Events¶

Multinomial logistic regression is an extension of binomial logistic regression applied for predicting categorical outcomes exceeding more than two classes.

Model Description¶

For $K$ number of classes in the categorical response variable $Y$, which can take on values $y \in\,\,{1,2,...,K}$;

$X$ being a vector of predictors, say, $X = [X_1,X_2,...,X_n]$, where $n$ is the number of predictors.

Probability Model¶

The multinomial logistic regression model estimates the probability of each class $k$ given the predictors $X$:

$$P(Y = k|X) = \frac{e^{\beta_{0k} + \beta_{1k}}X_1 + \beta_{2k}X_2 + \ldots + \beta_{pk}X_p}{\sum_{j=1}^{k} e^{\beta_{0j} + \beta_{1j}X_1 + \beta_{2j}X_2 + \ldots + \beta_{pj}X_p}}$$

where:

$P(Y = k|X)$ is the probability that the dependent variable $Y$ is equal to class $k$ given the predictor variables $X$;

$\beta_{0k}$ being the intercept for class $k$;

$\beta_{ik}$ being the coefficient for predctor $X_i$ for class $k$.

Reference Class¶

Customarily, one class is chosen as the reference class (typically class 1), and the probabilities for other classes are modeled relative to this reference class. Namely:

$$P(Y = k|X) = \frac{e^{\beta_{0k} + \beta_{1k}}X_1 + \beta_{2k}X_2 + \ldots + \beta_{pk}X_p}{1+ \sum_{j=2}^{k} e^{\beta_{0j} + \beta_{1j}X_1 + \beta_{2j}X_2 + \ldots + \beta_{pj}X_p}}\,\,\,\text{for}\,\,k = 2,..,K$$

where the probability for the reference class (class 1) is:

$$P(Y = k|X) = \frac{1}{\sum_{j=1}^{k} e^{\beta_{0j} + \beta_{1j}X_1 + \beta_{2j}X_2 + \ldots + \beta_{pj}X_p}}$$

Log Odds Ratios¶

The Log Odds Ratios (logit) for class $k$ relative to the reference class can be expressed as:

$$\text{log}(\frac{Y = k|X}{Y = 1|X}) = e^{\beta_{0k} + \beta_{1k}}X_1 + \beta_{2k}X_2 + \ldots + \beta_{pk}X_p$$

Such above exhibits the the log-odds of being in class $k$ relative to reference class can be modeled as a linear combination of the predictors.

Estimation¶

The coefficients $\beta_{ik}$ are estimated using the maximum likelihood estimation (MLE), finding the set of paraneters that maximizes the likelihood of the observed data given the model.

Abstract¶

The multinomial logistic regression model predicts the probabilities of different classes (categories) based on features via softmax (or sigmoid) function that transforms the linear combination of features into probabilities, ensuring that they sum to 1 across all classes.

In [226]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

multi_logit_data = hourly_dataframe_extreme_clean.copy()
# Prepare features and target
X = multi_logit_data[['temperature_2m', 'rain', 'wind_speed_10m',
                      'wind_speed_100m', 'wind_gusts_10m', 'pressure_msl']]
y = multi_logit_data['outlier_type']  # Fix 1: use Series

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Fit logistic regression
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=2000, class_weight='balanced')  # Fix 2: more iterations
model.fit(X_train, y_train)

# Predictions and report
y_pred = model.predict(X_test)
report = classification_report(y_test, y_pred, zero_division=0)
print(report)
              precision    recall  f1-score   support

           0       0.96      0.61      0.74      5511
           3       0.12      0.64      0.20       437
           6       0.50      1.00      0.67         1
           7       0.38      1.00      0.55       113

    accuracy                           0.62      6062
   macro avg       0.49      0.81      0.54      6062
weighted avg       0.89      0.62      0.70      6062

Class 0: High precision (0.96) → Most predicted class 0s were correct.

Low recall (0.61) → Many actual class 0s were misclassified.

This suggests the model is underpredicting class 0 or confusing it with minority classes.

Class 3: Very low precision (0.12) → Most predicted class 3s were actually another class.

High recall (0.64) → Many actual class 3s were found, but at the cost of high false positives.

Suggests class confusion, possibly because of feature overlap.

Class 6: Perfect recall/precision (1.0/0.50) but only 1 sample – not statistically meaningful.

Class 7: Moderate precision (0.38) and perfect recall (1.00) – model finds all class 7 cases but includes many false positives.

Accuracy: 62% — misleading due to class imbalance.

Macro average F1: 0.54 — shows performance is poor on minority classes.

Weighted avg F1: 0.70 — dominated by the majority class (0).

Logistic regression might be too rigid for capturing complex relationships.

Alternatives:

  1. RandomForestClassifier (robust, handles imbalance better)

  2. XGBoost or LightGBM

  3. GradientBoostingClassifier

Survival Analysis with (Hourly) Weather Data For Extreme Events¶

Survival analysis, a statistical methodology traditionally employed in fields like medicine and engineering, has found increasing application in the realm of meteorology. By treating weather events as "survival" times, researchers can gain valuable insights into their duration, frequency, and underlying factors.

One of the key challenges in applying survival analysis to weather data is the presence of censored events. Weather events often do not have a definitive endpoint, especially when the data collection period ends before the event concludes. This necessitates the use of survival analysis techniques that can handle censored observations, such as the Kaplan-Meier estimator.

Furthermore, weather patterns are influenced by a multitude of factors, including climate change, El Niño-Southern Oscillation, and local geographic conditions. These factors can be incorporated into survival models as time-varying covariates, providing a more nuanced understanding of the factors driving the duration of weather events.

Spatial dependencies also play a significant role in weather phenomena. Survival models can be extended to account for these dependencies, allowing for a more accurate representation of the spatial distribution of weather events.

Applications of survival analysis in weather data are diverse. For instance, researchers can use it to quantify the duration and frequency of extreme events like hurricanes, floods, and wildfires. This information can be invaluable for disaster management and risk assessment. Additionally, survival analysis can be employed to evaluate the impact of climate change on the occurrence and characteristics of weather events, aiding in climate adaptation planning.

The applied data set concerns hourly meteorological data focused on the New York City area.

In [229]:
survival_data = hourly_dataframe_extreme_clean[['date', 'temperature_2m', 'rain',
                                                'wind_speed_10m','wind_speed_100m',
                                                'wind_gusts_10m',
                                                'pressure_msl', 'outlier_type']].copy()

survival_data['year'] = survival_data['date'].dt.year
survival_data['month'] = survival_data['date'].dt.month
survival_data['day'] = survival_data['date'].dt.day
survival_data['hour'] = survival_data['date'].dt.hour

# SRecreate timestamp from parts (ensures consistent precision)
survival_data['timestamp'] = pd.to_datetime(survival_data[['year', 'month', 'day', 'hour']])

survival_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 30309 entries, 0 to 30308
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   date             30309 non-null  datetime64[ns, UTC]
 1   temperature_2m   30309 non-null  float32            
 2   rain             30309 non-null  float32            
 3   wind_speed_10m   30309 non-null  float32            
 4   wind_speed_100m  30309 non-null  float32            
 5   wind_gusts_10m   30309 non-null  float32            
 6   pressure_msl     30309 non-null  float32            
 7   outlier_type     30309 non-null  int64              
 8   year             30309 non-null  int32              
 9   month            30309 non-null  int32              
 10  day              30309 non-null  int32              
 11  hour             30309 non-null  int32              
 12  timestamp        30309 non-null  datetime64[ns]     
dtypes: datetime64[ns, UTC](1), datetime64[ns](1), float32(6), int32(4), int64(1)
memory usage: 2.1 MB

The Kaplan-Meier Estimator: A Tool for Weather Data Analysis¶

The Kaplan-Meier estimator (KME), a cornerstone of survival analysis (Stalpers and Kaplan 2018), has found applications beyond its traditional medical and engineering domains. In the field of meteorology, it can be employed to analyze the duration of weather events, such as heatwaves, cold spells, or droughts.

For event times $t_i$ being the times when an event (flood, rainfall, death, etc.) occurs. The KME estimator focuses only on such distinct event times, ignoring intervals where no events occur.The number of events $d_i$ being the count of events (flood, rainfall, death, etc.) occurring at each event time $t_i$.

The number at risk $n_i$ is the number of individuals (or areas) not yet experiencing the event or censored right before time $t_i$. It represents the group of individuals or areas who are at risk of experiencing the event at that specific time.

The KNME survival function is calculated as the product of survival probabilities over time:

$$\hat{S}(t) = \left(1 - \frac{n_1}{d_1}\right) \times \left(1 - \frac{n_2}{d_2}\right) \times \ldots \times \left(1 - \frac{n_k}{d_k}\right)$$

The above estimator assumes that:

  1. Censoring is independent of survival times.
  2. Survival probabilities are constant between event times.
  3. The risk set is updated accurately to reflect censored observations.The Kaplan-Meier estimator provides a step function that estimates the probability of survival beyond a certain time. It accounts for censored data (subjects who are lost to follow-up or whose event time extends beyond the study period) by adjusting the risk set accordingly at each step.

$S(t)$ being the probability that life is longer than $t$, the general function:

$$\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)$$

The survival curve will step down at each $t_i$ where the event ends; exhibiting the probability of an event continuing past a certain point. So, for $S(3)= 85$ , such conveys that 85% of particular event type lasts more than 3 days (months, etc.). The curve drops as more of the event type ends, conveying the likelihood of the event type persisting beyond each time point.

Weather events often exhibit characteristics that align well with the concepts of survival analysis. For instance, the duration of a heatwave can be considered a "survival time," and the event might be censored if it is still ongoing when the data collection period ends.

By applying the Kaplan-Meier estimator to weather data, researchers can:

  1. Estimate the duration of weather events: Quantify the average length of heatwaves, cold spells, or other extreme events.

  2. Compare the duration of events across different regions or time periods: Identify trends and variations in the persistence of weather phenomena.

  3. Assess the impact of climate change: Examine how the duration of weather events has changed over time and whether there are discernible trends related to climate change.

  4. Inform decision-making: Provide valuable insights for policymakers, emergency managers, and public health officials in planning and response to weather-related events.

The Kaplan-Meier estimator's ability to handle censored data is particularly valuable in weather analysis, as many events may not have a definitive endpoint within the study period. Additionally, the estimator can be used to create survival curves, which visually represent the probability of a weather event continuing beyond a certain duration.

The Weibull Parametric Model in Survival Analysis for Weather¶

The Weibull parametric model has emerged as a powerful tool in the field of survival analysis, particularly for analyzing weather-related phenomena. Its versatility in modeling various distributions, including exponential, Rayleigh, and extreme value, makes it a valuable choice for understanding the time to occurrence of weather events such as storms, droughts, or heatwaves.

The Weibull distribution is characterized by two parameters: the shape parameter $(k)$ and the scale parameter $(\lambda)$. The shape parameter determines the overall shape of the distribution, while the scale parameter influences the location of the distribution along the time axis. When $k = 1$, the Weibull distribution reduces to the exponential distribution, which is often used to model the time between events in a Poisson process.

In the context of Weibull parametric model theory (Li, Marcuss and Russell 2024), considering the accelerated failure time (AFT) model:

$$Y=\text {log}(T)=\mu+\alpha\,Z+\sigma\epsilon$$

where $T$ is the survival time, $\mu$ is the intercept, $Z$ is an $n$ by $p$ matrix, having $n$ as the number of samples and $p$ as the number of predictors/covariates; $\alpha$ is the coefficient of the predictors. $\epsilon$ is a random error term assumed to follow the extreme value distribution. for the Weibull distribution there is an additional parameter $\sigma$ which scales $\epsilon$. Let

$$\gamma = \frac{1}{\sigma}$$$$\lambda=e^{-\frac{\mu}{\sigma}}$$$$\beta=-\frac{\alpha}{\sigma}$$

Then to have the Weibull model with a baseline hazard of:

$$h(x|z)=(\gamma\lambda\,t^{\gamma-1})e^{\beta\,Z}$$

where $\gamma$ is the shape parameter, and $\lambda$ is the scale parameter. The hazard ratio (HR) is defined as:

$$HR=e^{\beta}$$

The Weibull model has found numerous applications in weather analysis. One such application is in the study of storm duration. By fitting the Weibull model to historical data on storm durations, researchers can estimate the probability of a storm lasting a certain duration. This information can be invaluable for emergency planning and disaster response.

Another important application of the Weibull model is in the analysis of the time between extreme weather events. This can help researchers understand the frequency and intensity of these events, such as droughts or heatwaves. By identifying patterns in the timing of extreme events, researchers can gain insights into the underlying factors driving their occurrence.

In addition to analyzing storm durations and extreme events, the Weibull model can also be used to study the failure time of weather-related equipment. This information can be used to optimize maintenance schedules and ensure the reliability of weather data. For example, by analyzing the failure times of weather sensors, researchers can determine the optimal frequency of inspections and repairs.

Finally, the Weibull model can be used to analyze extreme values of weather variables, such as temperature or precipitation. This can help identify and quantify extreme events that may pose significant risks to society and infrastructure. By understanding the probability of extreme events, researchers can develop strategies for mitigating their impacts.

The Weibull model offers several advantages that make it a valuable tool for weather analysis. One of its key advantages is its flexibility. The Weibull model can accommodate a wide range of distributions, making it suitable for modeling various weather phenomena.

Another advantage of the Weibull model is its ease of interpretation. The parameters of the Weibull model have clear interpretations, making it easier to understand the results of the analysis. This makes the model accessible to researchers and practitioners with varying levels of statistical expertise.

Furthermore, the Weibull model can be used for statistical inference, such as hypothesis testing and confidence interval estimation. This allows researchers to draw conclusions about the underlying population based on the sample data.

While the Weibull model is a powerful tool, it is important to be aware of its limitations and considerations. One limitation of the Weibull model is that it assumes that the hazard rate function is monotonic. If this assumption is violated, the model may not provide accurate results.

Another factor to consider is the quality of the data used in the analysis. The accuracy of the Weibull model depends on the quality of the data. Incomplete or biased data can lead to misleading results.

Finally, it is important to consider the appropriateness of the Weibull model for the specific weather phenomenon being studied. In some cases, other parametric or nonparametric models may be more suitable. It is essential to carefully consider the characteristics of the data and the research objectives when selecting a model.

Survival Analysis during the Hurricane Season¶

The Atlantic hurricane season spans from June 1st to November 30th, also coincides with the rainy season for Montserrat. Now, to develop and observe survival analysis for such season.

In [232]:
import pandas as pd 
import matplotlib.pyplot as plt 
from lifelines import KaplanMeierFitter, WeibullFitter

# Filter to hurricane season months (June to November) and create a fresh copy
survival_data = survival_data[survival_data['month'].between(6, 11)].copy()

# Define season start (June 1st midnight) for each year
survival_data['season_start'] = pd.to_datetime(survival_data['year'].astype(str) + '-06-01 00:00:00')

# Calculate duration in hours from season start
survival_data['duration'] = (survival_data['timestamp'] - survival_data['season_start']).dt.total_seconds() / 3600

# Ensure strictly positive durations (for Weibull model)
survival_data['duration'] = survival_data['duration'].apply(lambda x: x + 1e-6 if x <= 0 else x)

# Define binary event flag (1 = extreme event, 0 = normal)
survival_data['event'] = (survival_data['outlier_type'] > 0).astype(int)

# Assign time column
time_column = survival_data['duration']

# Clear previous figures
plt.clf()
plt.cla()
plt.close('all')

# === Kaplan-Meier Survival Curve ===
kmf = KaplanMeierFitter()
kmf.fit(durations=time_column, event_observed=survival_data['event'])

plt.figure(figsize=(10, 6))
kmf.plot_survival_function()
plt.title("Kaplan-Meier Survival Curve for Extreme Weather Events")
plt.xlabel("Hours Since June 1st")
plt.ylabel("Survival Probability")
plt.grid(True)
plt.show()

# === Weibull Parametric Survival Curve ===
wf = WeibullFitter()
wf.fit(durations=time_column, event_observed=survival_data['event'])

plt.figure(figsize=(10, 6))
wf.plot_survival_function()
plt.title("Weibull Survival Curve for Extreme Weather Events")
plt.xlabel("Hours Since June 1st")
plt.ylabel("Survival Probability")
plt.grid(True)
plt.show()
No description has been provided for this image
No description has been provided for this image

NOTE: the results above are only representative of the applied data (time range and place).

Interpretation¶

Kaplan-Meier

The Kaplan-Meier survival curve provided offers a visual representation of the probability of surviving (i.e., not experiencing) an extreme weather event over time. This statistical tool is commonly used in survival analysis to assess the likelihood of an event occurring within a specific timeframe.

A key observation from the curve is its general downward slope with downward concavity, indicating a decreasing probability of survival over time. This is expected, as the longer the observation period, the greater the chance of encountering an extreme weather event. Initially, the curve starts at a very high probability (close to 1.000), suggesting a low likelihood of such events at the beginning of the study period. However, as time progresses, the curve gradually slopes downward, indicating an increasing risk of experiencing an extreme weather event.

The shaded area around the curve represents the confidence interval, which indicates the range of possible survival probabilities. A narrower confidence band suggests greater certainty in the estimate, while a wider band indicates more uncertainty. In this case, the relatively narrow confidence bands suggest that the estimates are reasonably reliable.

Based on these observations, we can infer that:

  1. The study period began with a low probability of experiencing an extreme weather event.

  2. Over time, the risk of such events increased.

  3. The uncertainty in the estimates is relatively low.

Weibull

The provided Weibull survival curve offers a visual representation of the probability of surviving (i.e., not experiencing) an extreme weather event over time. This statistical tool is commonly used in survival analysis to model the distribution of failure times, in this case, the occurrence of extreme weather events. The observations:

  1. The curve shows a general downward slope, indicating a decreasing probability of survival over time. This is expected, as the longer the observation period, the greater the chance of encountering an extreme weather event.

  2. The blue line represents the Weibull estimate, which is a parametric model that fits a specific probability distribution to the data. In this case, the Weibull distribution is used to model the time to occurrence of extreme weather events.

  3. The shaded area around the curve represents the confidence interval, which indicates the range of possible survival probabilities. A wider confidence band suggests greater uncertainty in the estimate.

To now apply survival analysis with the Kaplan-Meier model:

Model Output Statistics (MOS) With Random Forest¶

Model Output Statistics (MOS) is a statistical technique used to calibrate the output of a numerical weather prediction (NWP) model. It involves training a statistical model on historical data to relate the raw NWP model output to observed values. This calibration can improve the accuracy and reliability of weather forecasts.

Random forest is a popular machine learning algorithm that can be used for both classification and regression tasks. When applied to weather forecasting, random forest can be used to predict various meteorological variables, such as temperature, precipitation, and wind speed.

To incorporate MOS with a random forest model for weather forecasting, the following steps are generally involved:

  1. Data Preparation:

Collect historical data for both the NWP model output and observed values. Ensure that the data is aligned in terms of time and location. Consider preprocessing the data, such as handling missing values or outliers.

  1. Random Forest Training:

Train a random forest model using the historical data. The input features of the model would be the NWP model output variables, and the target variable would be the corresponding observed values.

  1. MOS Calibration:

Once the random forest model is trained, apply it to the NWP model output to obtain calibrated forecasts. The calibrated forecasts are the output of the random forest model, which have been adjusted based on the historical relationship between the NWP model output and observed values.

Benefits of MOS with Random Forest:

  1. Improved Accuracy: MOS can help to correct systematic biases in the NWP model output, leading to more accurate forecasts.

  2. Enhanced Reliability: MOS can improve the reliability of forecasts, especially for extreme weather events.

  3. Better Calibration: MOS can calibrate the probabilistic output of the NWP model, providing more accurate estimates of uncertainty.

  4. Flexibility: Random forest is a flexible algorithm that can be applied to various weather variables and forecasting tasks.

The MOS Random Forest model aims to predict the corrected weather variable (e.g., precipitation) based on features derived from weather observations (e.g., temperature, humidity, pressure).

NOTE: due to time constraints, lack of resources and the sophistication of NWP models, an actual NWP model with its outputs will not be implemented. Rather, relying solely on a multivariate regression model structured on observed data (training observations) to compare with the unobserved (test set data).

To now acquire data to serve the MOS pursuit....

In [239]:
import openmeteo_requests

import pandas as pd
import requests_cache
from retry_requests import retry

# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)

# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://archive-api.open-meteo.com/v1/archive"
params = {
	"latitude": 16.7425,
	"longitude": -62.1874,
	"start_date": "2022-01-08",
	"end_date": "2025-06-24",
	"hourly": ["temperature_2m", "rain", "wind_speed_10m", "wind_speed_100m", "pressure_msl", "relative_humidity_2m", "dew_point_2m", "surface_pressure", "vapour_pressure_deficit", "boundary_layer_height", "cloud_cover_low", "cloud_cover_mid", "cloud_cover_high", "diffuse_radiation_instant"],
	"timezone": "auto"
}
responses = openmeteo.weather_api(url, params=params)

# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()}{response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")

# Process hourly data. The order of variables needs to be the same as requested.
hourly = response.Hourly()
hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
hourly_rain = hourly.Variables(1).ValuesAsNumpy()
hourly_wind_speed_10m = hourly.Variables(2).ValuesAsNumpy()
hourly_wind_speed_100m = hourly.Variables(3).ValuesAsNumpy()
hourly_pressure_msl = hourly.Variables(4).ValuesAsNumpy()
hourly_relative_humidity_2m = hourly.Variables(5).ValuesAsNumpy()
hourly_dew_point_2m = hourly.Variables(6).ValuesAsNumpy()
hourly_surface_pressure = hourly.Variables(7).ValuesAsNumpy()
hourly_vapour_pressure_deficit = hourly.Variables(8).ValuesAsNumpy()
hourly_boundary_layer_height = hourly.Variables(9).ValuesAsNumpy()
hourly_cloud_cover_low = hourly.Variables(10).ValuesAsNumpy()
hourly_cloud_cover_mid = hourly.Variables(11).ValuesAsNumpy()
hourly_cloud_cover_high = hourly.Variables(12).ValuesAsNumpy()
hourly_diffuse_radiation_instant = hourly.Variables(13).ValuesAsNumpy()

hourly_data = {"date": pd.date_range(
	start = pd.to_datetime(hourly.Time(), unit = "s", utc = True),
	end = pd.to_datetime(hourly.TimeEnd(), unit = "s", utc = True),
	freq = pd.Timedelta(seconds = hourly.Interval()),
	inclusive = "left"
)}

hourly_data["temperature_2m"] = hourly_temperature_2m
hourly_data["rain"] = hourly_rain
hourly_data["wind_speed_10m"] = hourly_wind_speed_10m
hourly_data["wind_speed_100m"] = hourly_wind_speed_100m
hourly_data["pressure_msl"] = hourly_pressure_msl
hourly_data["relative_humidity_2m"] = hourly_relative_humidity_2m
hourly_data["dew_point_2m"] = hourly_dew_point_2m
hourly_data["surface_pressure"] = hourly_surface_pressure
hourly_data["vapour_pressure_deficit"] = hourly_vapour_pressure_deficit
hourly_data["boundary_layer_height"] = hourly_boundary_layer_height
hourly_data["cloud_cover_low"] = hourly_cloud_cover_low
hourly_data["cloud_cover_mid"] = hourly_cloud_cover_mid
hourly_data["cloud_cover_high"] = hourly_cloud_cover_high
hourly_data["diffuse_radiation_instant"] = hourly_diffuse_radiation_instant

MOS_hourly_dataframe = pd.DataFrame(data = hourly_data)
print(MOS_hourly_dataframe)
Coordinates 16.76625633239746°N -62.20843505859375°E
Elevation 309.0 m asl
Timezone b'America/Montserrat'b'GMT-4'
Timezone difference to GMT+0 -14400 s
                           date  temperature_2m  rain  wind_speed_10m  \
0     2022-01-08 04:00:00+00:00       23.249001   0.0       28.146843   
1     2022-01-08 05:00:00+00:00       22.598999   0.0       27.255590   
2     2022-01-08 06:00:00+00:00       22.348999   0.0       30.498180   
3     2022-01-08 07:00:00+00:00       21.848999   0.1       28.241076   
4     2022-01-08 08:00:00+00:00       22.098999   0.1       29.215502   
...                         ...             ...   ...             ...   
30331 2025-06-24 23:00:00+00:00       25.949001   0.0       43.795891   
30332 2025-06-25 00:00:00+00:00       25.398998   0.0       43.793671   
30333 2025-06-25 01:00:00+00:00             NaN   NaN             NaN   
30334 2025-06-25 02:00:00+00:00             NaN   NaN             NaN   
30335 2025-06-25 03:00:00+00:00             NaN   NaN             NaN   

       wind_speed_100m  pressure_msl  relative_humidity_2m  dew_point_2m  \
0            34.634918   1018.500000             71.679909     17.848999   
1            33.466450   1018.299988             76.695610     18.299000   
2            36.707645   1017.599976             76.176575     17.949001   
3            34.743263   1017.500000             79.526360     18.148998   
4            35.565376   1017.400024             80.060951     18.499001   
...                ...           ...                   ...           ...   
30331        49.774147   1016.799988             75.106316     21.199001   
30332        49.785542   1017.099976             81.488701     21.999001   
30333              NaN           NaN                   NaN           NaN   
30334              NaN           NaN                   NaN           NaN   
30335              NaN           NaN                   NaN           NaN   

       surface_pressure  vapour_pressure_deficit  boundary_layer_height  \
0            982.982544                 0.807554                  805.0   
1            982.713318                 0.638901                  805.0   
2            982.008240                 0.643319                  750.0   
3            981.852722                 0.536290                  795.0   
4            981.785583                 0.530283                  835.0   
...                 ...                      ...                    ...   
30331        981.655457                 0.833751                 1160.0   
30332        981.881592                 0.600079                 1000.0   
30333               NaN                      NaN                    NaN   
30334               NaN                      NaN                    NaN   
30335               NaN                      NaN                    NaN   

       cloud_cover_low  cloud_cover_mid  cloud_cover_high  \
0                 16.0             24.0               0.0   
1                  0.0             35.0               0.0   
2                 52.0             43.0               0.0   
3                  1.0             41.0               0.0   
4                 28.0             13.0               0.0   
...                ...              ...               ...   
30331             58.0              0.0             100.0   
30332             42.0              0.0             100.0   
30333              NaN              NaN               NaN   
30334              NaN              NaN               NaN   
30335              NaN              NaN               NaN   

       diffuse_radiation_instant  
0                            0.0  
1                            0.0  
2                            0.0  
3                            0.0  
4                            0.0  
...                          ...  
30331                        0.0  
30332                        0.0  
30333                        NaN  
30334                        NaN  
30335                        NaN  

[30336 rows x 15 columns]

Some cleaning and probing of the data:

In [241]:
MOS_data = MOS_hourly_dataframe.dropna()
MOS_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25965 entries, 0 to 30332
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype              
---  ------                     --------------  -----              
 0   date                       25965 non-null  datetime64[ns, UTC]
 1   temperature_2m             25965 non-null  float32            
 2   rain                       25965 non-null  float32            
 3   wind_speed_10m             25965 non-null  float32            
 4   wind_speed_100m            25965 non-null  float32            
 5   pressure_msl               25965 non-null  float32            
 6   relative_humidity_2m       25965 non-null  float32            
 7   dew_point_2m               25965 non-null  float32            
 8   surface_pressure           25965 non-null  float32            
 9   vapour_pressure_deficit    25965 non-null  float32            
 10  boundary_layer_height      25965 non-null  float32            
 11  cloud_cover_low            25965 non-null  float32            
 12  cloud_cover_mid            25965 non-null  float32            
 13  cloud_cover_high           25965 non-null  float32            
 14  diffuse_radiation_instant  25965 non-null  float32            
dtypes: datetime64[ns, UTC](1), float32(14)
memory usage: 1.8 MB

Recalling that for continuous variables Pearson Corrrelation serves well for association among variables and to measure the level f linearity. Again, there is no rule about variable relations needing to be linear.

In [243]:
# Applying pearson correlation to the data set.
import matplotlib.pyplot as plt
import seaborn as sns
pearson_corr_hourly = MOS_data.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (18, 14))
sns.heatmap(pearson_corr_hourly, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation Heatmap for Hourly Data')
plt.savefig('heatmap.pdf', format='pdf')
plt.show()
No description has been provided for this image

Based on observations from the prior correlation heatmap, one can conclude that a basic OLS linear prediction model will not be adequate. One can deduce that most scatter plot pairs will not have linear characteristics. However, a quantile regression model will generally resolve the inadequacy of OLS models.

The Base Model Formula:

$$ Q_{y_i}(\tau \mid \mathbf{x}_i) = \mathbf{x}_i^\top \boldsymbol{\beta}(\tau) + \epsilon $$

Advantages of This Approach:

  1. Applying a quantile regression model as the base model instead of a complex NWP model reduces computational complexity. The simplest NWP models are the Barotropic model and the Baroclinical model. Such two models don't account for the target of interest. As well, data for the attributes of such two models can be quite elusive and tedious to wrangle into meaningful measurements. Additionally, the boundary conditions, appropriate parameters, relevant time scale and computational complexity are serious concerns; the ability to find a decent fit for montserrat can be extremely challenging. A regression model directly applies weather data generally without temporal considerations.

  2. The random forest MOS model is capable of learning complex, nonlinear relationships in the errors of the base regression model, improving the overall forecast accuracy. This "adopted scheme" is easily scalable to forecast other weather variables without abstract mathematical physics equations.

  3. In the MOS random forest setup the forest adapts to the coefficients of the regression model based on different weather situations (e.g., different temperature ranges or pressure levels, etc.).

Feature Selection¶

The target or response variable of concern is rain (fall), measured in millimeters. Recognising Meteorology or Climatology as serious fields of sustainable professional development the base model must be at least respectable concerning predictors or features. Hence, will apply feature selection as a preliminary step.

In [247]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# List of targets
targets = ['rain']

# Loop through each target
for target in targets:
    print(f"\n{'='*60}\nAnalyzing Target: {target}\n{'='*60}")

    # Define features: drop current target from targets list + use all other columns
    possible_features = MOS_data.drop(columns=['rain', 'date'])

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        possible_features, 
        MOS_data[target], 
        test_size=0.2, 
        random_state=42
    )

    # Initialize model
    rf_model = RandomForestRegressor(n_estimators=50, random_state=42)

    # Fit model
    rf_model.fit(X_train, y_train)

    # Feature importances
    importances = rf_model.feature_importances_
    feature_importances = pd.DataFrame({
        'Feature': X_train.columns,
        'Importance': importances
    }).sort_values(by='Importance', ascending=False)

    # Plot feature importances
    plt.figure(figsize=(12, 6))
    plt.barh(feature_importances['Feature'], feature_importances['Importance'], color='skyblue')
    plt.xlabel('Importance')
    plt.title(f'Feature Importances for Target: {target}')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

    # Print ranked features
    print("Ranked Features based on Importance:")
    print(feature_importances)

    # Recursive Feature Elimination
    rfe = RFE(estimator=rf_model, n_features_to_select=5)
    rfe.fit(X_train, y_train)
    selected_features = X_train.columns[rfe.support_]

    print("Selected Features by RFE:")
    print(selected_features.tolist())
============================================================
Analyzing Target: rain
============================================================
No description has been provided for this image
Ranked Features based on Importance:
                      Feature  Importance
2             wind_speed_100m    0.144701
9             cloud_cover_low    0.126407
4        relative_humidity_2m    0.116968
7     vapour_pressure_deficit    0.107037
10            cloud_cover_mid    0.094817
6            surface_pressure    0.092521
1              wind_speed_10m    0.065555
8       boundary_layer_height    0.059934
3                pressure_msl    0.049230
12  diffuse_radiation_instant    0.036907
5                dew_point_2m    0.035584
11           cloud_cover_high    0.035510
0              temperature_2m    0.034828
Selected Features by RFE:
['wind_speed_100m', 'relative_humidity_2m', 'surface_pressure', 'vapour_pressure_deficit', 'cloud_cover_low']

Based on the feature selection operation and the correlation heat map from earlier, the above result is acceptable.

To now build a inexplicit base model and gauge its performance with unobserved data.

In [250]:
import pandas as pd
from sklearn.linear_model import QuantileRegressor
from sklearn.metrics import mean_absolute_error, mean_pinball_loss, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
import numpy as np

# Define quantiles
quantiles = [0.25, 0.5, 0.75, 0.9]
models = {}
predictions = {}

# Identifying the features and the target
X = MOS_data[['wind_speed_100m', 'relative_humidity_2m',
              'surface_pressure', 'vapour_pressure_deficit',
              'cloud_cover_low']]
y = MOS_data['rain']  # Target as 1D array

# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train QuantileRegressor for each quantile
for q in quantiles:
    model = QuantileRegressor(quantile=q, alpha=0, solver='highs')
    model.fit(X_train, y_train)
    models[q] = model
    predictions[q] = model.predict(X_test)

# Median prediction
y_pred_median = predictions[0.5]

# 1. Regression Evaluation
mae = mean_absolute_error(y_test, y_pred_median)
pinball = mean_pinball_loss(y_test, y_pred_median, alpha=0.5)

print(f"--- Regression Evaluation (Median model) ---")
print(f"MAE: {mae:.3f}")
print(f"Pinball Loss (q=0.5): {pinball:.3f}")

# 2a. Classification Evaluation — Option 1: Lower percentile threshold (40th)
threshold_40 = np.percentile(y_train, 40)

y_class_true_40 = (y_test > threshold_40).astype(int)
y_class_pred_40 = (y_pred_median > threshold_40).astype(int)

acc_40 = accuracy_score(y_class_true_40, y_class_pred_40)
cm_40 = confusion_matrix(y_class_true_40, y_class_pred_40)

print(f"\n--- Classification Evaluation (Threshold = 40th percentile of y_train) ---")
print(f"Threshold value: {threshold_40:.3f}")
print(f"Accuracy: {acc_40:.3f}")
print("Confusion Matrix:")
print(cm_40)

# 2b. Classification Evaluation — Option 2: Model's predicted median threshold
threshold_pred = np.median(y_pred_median)

y_class_true_pred = (y_test > threshold_pred).astype(int)
y_class_pred_pred = (y_pred_median > threshold_pred).astype(int)

acc_pred = accuracy_score(y_class_true_pred, y_class_pred_pred)
cm_pred = confusion_matrix(y_class_true_pred, y_class_pred_pred)

print(f"\n--- Classification Evaluation (Threshold = median of predicted values) ---")
print(f"Threshold value: {threshold_pred:.3f}")
print(f"Accuracy: {acc_pred:.3f}")
print("Confusion Matrix:")
print(cm_pred)
--- Regression Evaluation (Median model) ---
MAE: 0.089
Pinball Loss (q=0.5): 0.044

--- Classification Evaluation (Threshold = 40th percentile of y_train) ---
Threshold value: 0.000
Accuracy: 0.749
Confusion Matrix:
[[3891    0]
 [1302    0]]

--- Classification Evaluation (Threshold = median of predicted values) ---
Threshold value: 0.000
Accuracy: 0.749
Confusion Matrix:
[[3891    0]
 [1302    0]]

Manually provide counts...

TN, FP, FN, TP = 3891, 1302, 0, 0

METRICS: ACCURACY = (TP + TN) / (TP + TN + FP + FN)

PRECISION = TP / (TP + FP) # How many predicted 1s are correct

RECALL = TP / (TP + FN) # How many actual 1s were captured

F1 = 2 * precision * recall / (precision + recall)

Accuracy: 0.749 74.9% of test samples are classified as "not above median", but...

Precision undefined (division by zero) Because no positives were predicted

Recall 0.000 Model predicted no positives — it missed all actual positive rainfall cases

F1 Score 0.000 No balance between precision and recall — this is a dead classifier

Of consequence, will revert to a model without feature selection.

In [253]:
# Identifying the features and the target
X = MOS_data.drop(columns = ['rain', 'date'])
y = MOS_data['rain']  # Target as 1D array

# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define quantiles
quantiles = [0.25, 0.5, 0.75, 0.95]
models = {}
predictions = {}

# 1. Train QuantileRegressor for each quantile
for q in quantiles:
    model = QuantileRegressor(quantile=q, alpha=0, solver='highs')
    model.fit(X_train, y_train)
    models[q] = model
    predictions[q] = model.predict(X_test)

# 2. Regression Evaluation (Example: for the median model)
y_pred_median = predictions[0.5]

mae = mean_absolute_error(y_test, y_pred_median)
pinball = mean_pinball_loss(y_test, y_pred_median, alpha=0.5)

print(f"--- Regression Evaluation (Median model) ---")
print(f"MAE: {mae:.3f}")
print(f"Pinball Loss (q=0.5): {pinball:.3f}")

# 3. Classification-style Evaluation
# Example: classify if true target is above or below the predicted median
# This mimics a binary classifier

y_class_true = (y_test > np.median(y_train)).astype(int)  # True: above historical median
y_class_pred = (y_pred_median > np.median(y_train)).astype(int)

acc = accuracy_score(y_class_true, y_class_pred)
cm = confusion_matrix(y_class_true, y_class_pred)

print(f"\n--- Classification Evaluation (based on 50th percentile threshold) ---")
print(f"Accuracy: {acc:.3f}")
print("Confusion Matrix:")
print(cm)
--- Regression Evaluation (Median model) ---
MAE: 0.088
Pinball Loss (q=0.5): 0.044

--- Classification Evaluation (based on 50th percentile threshold) ---
Accuracy: 0.507
Confusion Matrix:
[[1347 2544]
 [  17 1285]]
           Predicted
          |   0   |   1   |
    ------------------------
True  0  | 1347  | 2544  |
True  1  |   17  | 1285  |

ANALYSIS --

Accuracy (50.7%) Barely better than flipping a coin — suggests the model is misclassifying a large number of observations.

Precision (33.6%) Only 1 in 3 predicted positives is actually correct. High false positive rate.

Recall (98.7%) Nearly all actual positives are correctly identified — very few false negatives.

F1 Score (50.1%) A moderate harmonic balance between precision and recall. Weighted toward recall due to high imbalance.

Strength -- High Recall (98.7%): The model is excellent at capturing actual positive cases (e.g., identifying risky, extreme, or high-priority instances).

Weakness -- Very Low Precision (33.6%): Most of the predicted positives are actually false. That’s a high false alarm rate.

This model behaves like a “better-safe-than-sorry” classifier:

1. It labels almost everything potentially risky (or “positive”).

2. Almost never misses a real positive, but triggers a lot of unnecessary alarms.

This is good if:

1. False negatives are dangerous/costly

E.g., detecting:

        Floods

It’s problematic if:

        False positives are expensive or disruptive

        E.g., costly interventions, user alerts, wasted inspections, etc.

Of consequence, to proceed with a quantile regression model, but now to develop an explicit model...

In [256]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Identifying the features and the target
X = MOS_data.drop(columns = ['rain', 'date'])
y = MOS_data['rain']  # Target as 1D array

# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify clearly the the model coefficients 
import statsmodels.api as sm

# Standardization scales features so that they have a mean of 0 and a standard deviation of 1.
# Multiciollinearity is not a serious issue based on observation of the Pearson correlation matrix;
        # All highly correlated pairs are resolved by dropping feat with lower feature importance. 
# There are no serious near-linear dependencies in the predictors from observation of the Pearson correlation matrix.
# The meteorological features have considerable different scales however, so to standardize.


# Add constant (intercept)
X_with_const = sm.add_constant(X_train)

# Fit quantile regression model at quantile 0.5 (median)
quantile = 0.5
sm_model = sm.QuantReg(y_train, X_with_const)
result = sm_model.fit(q=quantile)

# Print summary with variable names
print(result.summary())
                         QuantReg Regression Results                          
==============================================================================
Dep. Variable:                   rain   Pseudo R-squared:              0.01618
Model:                       QuantReg   Bandwidth:                    0.002483
Method:                 Least Squares   Sparsity:                      0.02279
Date:                Fri, 27 Jun 2025   No. Observations:                20772
Time:                        23:57:29   Df Residuals:                    20758
                                        Df Model:                           13
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                         1.3545      0.238      5.702      0.000       0.889       1.820
temperature_2m               -0.0542      0.009     -5.721      0.000      -0.073      -0.036
wind_speed_10m                0.0002   6.17e-05      3.586      0.000       0.000       0.000
wind_speed_100m              -0.0002   5.29e-05     -3.883      0.000      -0.000      -0.000
pressure_msl                 -0.4430      0.076     -5.821      0.000      -0.592      -0.294
relative_humidity_2m          0.0011      0.000      6.527      0.000       0.001       0.001
dew_point_2m                 -0.0004      0.001     -0.534      0.594      -0.002       0.001
surface_pressure              0.4588      0.079      5.820      0.000       0.304       0.613
vapour_pressure_deficit       0.0327      0.005      6.990      0.000       0.023       0.042
boundary_layer_height     -1.475e-06   6.73e-07     -2.194      0.028   -2.79e-06   -1.57e-07
cloud_cover_low               0.0002    4.8e-06     38.690      0.000       0.000       0.000
cloud_cover_mid               0.0019   4.84e-06    393.481      0.000       0.002       0.002
cloud_cover_high           2.031e-06   2.11e-06      0.964      0.335    -2.1e-06    6.16e-06
diffuse_radiation_instant  1.956e-05   1.14e-06     17.149      0.000    1.73e-05    2.18e-05
=============================================================================================

The condition number is large, 5.34e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
C:\Users\verlene\anaconda3\Lib\site-packages\statsmodels\regression\quantile_regression.py:191: IterationLimitWarning: Maximum number of iterations (1000) reached.
  warnings.warn("Maximum number of iterations (" + str(max_iter) +

Dropping the features with poor p-values.

In [257]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Identifying the features and the target
X = MOS_data.drop(columns = ['rain', 'date',
                             'dew_point_2m', 'cloud_cover_high',
                             'pressure_msl'])
y = MOS_data['rain']  # Target as 1D array

# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify clearly the the model coefficients 
import statsmodels.api as sm


# Add constant (intercept)
X_with_const = sm.add_constant(X_train)

# Fit quantile regression model at quantile 0.5 (median)
quantile = 0.5
sm_model = sm.QuantReg(y_train, X_with_const)
result = sm_model.fit(q=quantile)

# Print summary with variable names
print(result.summary())
                         QuantReg Regression Results                          
==============================================================================
Dep. Variable:                   rain   Pseudo R-squared:              0.01610
Model:                       QuantReg   Bandwidth:                    0.002377
Method:                 Least Squares   Sparsity:                      0.02295
Date:                Fri, 27 Jun 2025   No. Observations:                20772
Time:                        23:57:29   Df Residuals:                    20761
                                        Df Model:                           10
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                        -0.0103      0.044     -0.236      0.814      -0.096       0.075
temperature_2m               -0.0012      0.000     -6.498      0.000      -0.002      -0.001
wind_speed_10m                0.0003    6.1e-05      4.114      0.000       0.000       0.000
wind_speed_100m              -0.0002   5.25e-05     -4.300      0.000      -0.000      -0.000
relative_humidity_2m          0.0008      0.000      6.605      0.000       0.001       0.001
surface_pressure          -4.139e-05   4.39e-05     -0.944      0.345      -0.000    4.46e-05
vapour_pressure_deficit       0.0253      0.004      7.091      0.000       0.018       0.032
boundary_layer_height     -1.973e-06   6.66e-07     -2.963      0.003   -3.28e-06   -6.68e-07
cloud_cover_low               0.0002   4.77e-06     39.313      0.000       0.000       0.000
cloud_cover_mid               0.0019   4.86e-06    390.480      0.000       0.002       0.002
diffuse_radiation_instant  1.956e-05   1.14e-06     17.234      0.000    1.73e-05    2.18e-05
=============================================================================================

The condition number is large, 6.93e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Will now build a model based on feature selection from earlier...

In [259]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Identifying the features and the target
X = MOS_data[['wind_speed_100m', 'relative_humidity_2m', 'surface_pressure',
                             'vapour_pressure_deficit', 'cloud_cover_low']]
y = MOS_data['rain']  # Target as 1D array

# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify clearly the the model coefficients 
import statsmodels.api as sm


# Add constant (intercept)
X_with_const = sm.add_constant(X_train)

# Fit quantile regression model at quantile 0.5 (median)
quantile = 0.5
sm_model = sm.QuantReg(y_train, X_with_const)
result = sm_model.fit(q=quantile)

# Print summary with variable names
print(result.summary())
                         QuantReg Regression Results                          
==============================================================================
Dep. Variable:                   rain   Pseudo R-squared:           -1.102e-06
Model:                       QuantReg   Bandwidth:                     0.01324
Method:                 Least Squares   Sparsity:                      0.02388
Date:                Fri, 27 Jun 2025   No. Observations:                20772
Time:                        23:57:30   Df Residuals:                    20766
                                        Df Model:                            5
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                    -5.36e-07      0.044  -1.22e-05      1.000      -0.086       0.086
wind_speed_100m         -4.012e-09   9.69e-06     -0.000      1.000    -1.9e-05     1.9e-05
relative_humidity_2m     5.062e-08   3.88e-05      0.001      0.999    -7.6e-05    7.61e-05
surface_pressure        -4.344e-09    4.4e-05  -9.87e-05      1.000   -8.62e-05    8.62e-05
vapour_pressure_deficit  1.081e-06      0.001      0.001      0.999      -0.002       0.002
cloud_cover_low           6.18e-08   4.71e-06      0.013      0.990   -9.17e-06    9.29e-06
===========================================================================================

The condition number is large, 5.23e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

NOTE: A low pseudo R² doesn't always mean the model is bad — especially for complex, noisy phenomena like precipitation concerning hourly increments. Rainfall is highly stochastic — hard to model with high R².

In [261]:
rain_data = MOS_data[['date', 'rain']]
rain_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25965 entries, 0 to 30332
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype              
---  ------  --------------  -----              
 0   date    25965 non-null  datetime64[ns, UTC]
 1   rain    25965 non-null  float32            
dtypes: datetime64[ns, UTC](1), float32(1)
memory usage: 507.1 KB

Some basic time series analysis or decomposition.

In [262]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
import statsmodels.api as sm

# Set the 'date' column as the DataFrame index
rain_data = rain_data.set_index('date')

# Ensure the data is sorted by date
rain_data = rain_data.sort_index()

# For time series analysis, it's often good practice to resample to a regular frequency.
# Here, we resample to a daily frequency and fill any missing values with 0.
# You might choose a different resampling frequency (e.g., 'W' for weekly, 'M' for monthly)
# depending on the nature of your rain data and the seasonality you expect.
rain_data_hourly = rain_data['rain'].resample('h').sum().fillna(0)

# --- Time Series Decomposition ---
# We use seasonal_decompose to break down the time series into trend, seasonal, and residual components.
# 'model' can be 'additive' or 'multiplicative'. 'additive' is suitable when the
# seasonal fluctuations are roughly constant over time. 'multiplicative' is for
# when they change proportionally to the level of the series.
# 'period' is the number of observations in a cycle. For hourly data with weekly seasonality, period=24.
# Adjust 'period' based on your data's primary seasonality (e.g., 12 for monthly data with yearly seasonality).
try:
    decomposition = seasonal_decompose(rain_data_hourly, model='additive', period=24)

    # Plotting the decomposition
    fig, axes = plt.subplots(4, 1, figsize=(12, 10), sharex=True)

    axes[0].plot(decomposition.observed)
    axes[0].set_ylabel('Observed')
    axes[0].set_title('Time Series Decomposition of Rain Data')

    axes[1].plot(decomposition.trend)
    axes[1].set_ylabel('Trend')

    axes[2].plot(decomposition.seasonal)
    axes[2].set_ylabel('Seasonal')

    axes[3].plot(decomposition.resid)
    axes[3].set_ylabel('Residual')
    axes[3].set_xlabel('Date')

    plt.tight_layout(rect=[0, 0.03, 1, 0.96]) # Adjust layout to prevent title overlap
    plt.show()

except Exception as e:
    print(f"Error during time series decomposition: {e}")
    print("Please check if your time series data has enough observations for the chosen 'period'.")
    print("For example, if period=7, you need at least 14 data points for decomposition to work well.")


# --- LOWESS (Locally Weighted Scatterplot Smoothing) ---
# LOWESS is a non-parametric regression method that fits a series of local linear regressions
# to smooth a scatter plot. It's great for visualizing the trend in noisy data.

# 'frac' parameter: controls the smoothness. It's the fraction of data used when estimating
# each local regression. Smaller frac = less smooth, larger frac = more smooth.
# Typical values are between 0.1 and 0.8. Adjust based on how much smoothing you need.
lowess_smoothed = sm.nonparametric.lowess(rain_data_hourly.values, rain_data_hourly.index.astype(np.int64), frac=0.1)

# Convert the output back to a DataFrame with datetime index for easier plotting
lowess_df = pd.DataFrame(lowess_smoothed, columns=['date_int', 'smoothed_rain'])
lowess_df['date'] = pd.to_datetime(lowess_df['date_int'])
lowess_df = lowess_df.set_index('date')
lowess_df = lowess_df.sort_index()

# Plotting LOWESS smoothing
plt.figure(figsize=(12, 6))
plt.plot(rain_data_hourly.index, rain_data_hourly, label='Original Rain Data', alpha=0.7)
plt.plot(lowess_df.index, lowess_df['smoothed_rain'], color='red', linewidth=2, label='LOWESS Smoothed')
plt.title('LOWESS Smoothing of Rain Data')
plt.xlabel('Date')
plt.ylabel('Rain')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image

Based on seasonal component the data seems to genuinely lacks daily seasonality. The hourly values of 'rain' don’t show a consistent pattern each day.

Tests For Seasonality¶

Will pursue seasonality check by spectral analysis with periodogram, and provide mention on the Ljung-Box test.

Spectral Analysis or Periodogram¶

Spectral analysis, particularly using the periodogram, is based on the Fourier transform of a time series. It quantifies how the variance (power) of a signal is distributed across different frequencies. The mathematical structure and intuition behind it:

1. Signal Orientation

Let $x(t)$ be a real-valued time series sampled at regular intervals, where:

$ t = 1, 2,..., N - 1 $ (discrete time steps);

Sample interval is $ \Delta\,t$ (1 hour);

Total duration being, $ T = N \times \Delta\,t $

2. Discrete Fourier Transform

Let $x(t)$ be a discrete time series with $N$ samples. The Discrete Fourier Transform (DFT) is defined as:

$$X(f_k) = \sum_{t=0}^{N-1} x(t) \cdot e^{-2\pi i f_k t}$$

where the discrete frequency $( f_k)$ is given by:

$$f_k = \frac{k}{N \Delta t}, \quad \text{for } k = 0, 1, 2, \dots, N-1 $$

The corresponding periodogram (power spectral density estimate) is:

$$P(f_k) = \frac{1}{N} \left| X(f_k) \right|^2$$

3. Periodogram Definition

The periodogram estimates the power spectral density (PSD) of $x(t)$ as:

$$P(f_k) = \frac{1}{N}\,\,\left|X(f_k)\right|^2$$

This represents how the power (variance) of the time series is distributed across frequencies $f_k$

Units of $P(f_k)$ depend on the units of $x(t)$; often variance per unit frerequency.

For real-valued time series, the periodogram is symmetric around the Nyquist frequency $f = \frac{1}{2\Delta\,t}$.

4. Interpretation of Peaks

A peak in $P(f_k)$ indicates a dominant cycle of period $T_k = \frac{1}{f_k}$.

Example: a strong peak at $f_k = \frac{1}{24}$ (cycles/hour) corresponds to a 24-hour cycle.

5. Continuous-Time Analogy

In continuous time, the Power Spectral Density (PSD), $S(f)$ is defined via the Weiner-Khinchin theorem:

$$S(f) = \int_{-\infty}^{\infty} R(\tau) \, e^{-2\pi i f \tau} \, d\tau$$

where $R(\tau) = \mathbb{E}[x(t) x(t+\tau)]$ is the autocorrelation function of the process $x(t)$, and $f$ is the frequency (in Hz).

So, spectral analysis or periodogram concerns testing for hidden cycles or periodic structure.

  1. A flat power spectrum (white noise) supports randomness.

  2. Dominant Peaks suggest non-random periodic behavior.

In [266]:
import numpy as np
from scipy.signal import periodogram
import matplotlib.pyplot as plt

# Acquiring a time and attribute domain or dataframe
rain_series_test = MOS_data[['date', 'rain']]

# Extract rain values as 1D array
rain_series = rain_series_test['rain'].values

# Set sampling frequency: 1/hour (since data is hourly)
fs = 1  # samples per hour

# Compute periodogram
f, Pxx = periodogram(rain_series, fs=fs)

# Exclude zero frequency (DC component)
f = f[1:]
Pxx = Pxx[1:]

# Convert frequency to period (hours and days)
period_hours = 1 / f
period_days = period_hours / 24

# Plot periodogram with period axis (in days)
plt.figure(figsize=(14, 6))
plt.semilogy(period_days, Pxx)
plt.title("Periodogram of Hourly Rainfall in Montserrat")
plt.xlabel("Period (Days)")
plt.ylabel("Power")
plt.grid(True)
plt.xscale('log')
plt.axvline(1, color='red', linestyle='--', label='Daily cycle (1d)')
plt.axvline(7, color='green', linestyle='--', label='Weekly cycle (7d)')
plt.axvline(30, color='orange', linestyle='--', label='Monthly cycle (30d)')
plt.axvline(180, color='purple', linestyle='--', label='Seasonal cycle (180d)')
plt.legend()
plt.show()
No description has been provided for this image

NOTE: for peaks that greatly stand out they are designated as strong peaks, to identify observed cycles w.r.t. to the associated time. From observation, peaks appear to sluggishly become more identifiable when approaching the seasonal duration. However, there are other peaks comparable to the one that's well aligned with the seasonal cycle point. such sluggish strong peak development isn't impressive to recognise any definitive cycles.

Ljung-Box Test¶

The Ljung-Box test is a statistical test designed to detect whether a time series exhibits significant autocorrelation at lags up to a specified maximum lag $h$. It is widely used to evaluate whether residuals from a time series model resemble white noise. The test improves upon the Box-Pierce statistic by applying a small-sample correction.

  1. NULL AND ALTERNATIVE HYPOTHESES --

$H_0$: the data are independent (no autocorrelation up to a lag $h$)

$H_a$: The data are not independent (at least one autocorrelation is non-zero up to lag $h$)

  1. TEST STATISTIC --

Have:

$n$: number of observations

$h$: number of lags tested

$r_k$: sample autocorrelation at lag $k$

Then the Ljung-Box test statistic is:

$$ Q = n(n+2) \sum_{k=1}^{h} \frac{r_k^2}{n - k} $$
  1. DISTRIBUTION --

Under $H_0$, the test statistic approximately follows a chi-squared distribution with $h$ degrees of freedom:

$$ Q \sim \chi^2(h) $$

The corresponding p-value is computed as:abs

$$p = \mathbb{P}(\chi^2_h > Q) $$
  1. INTERPRETATION --

If $p < \alpha$ (e.g., 0.05), reject $H_0 $: Evidence of autocorrelation.

If $p \geq \alpha$, fail to reject $H_0$: No significant autocorrelation detected.

  1. APPLICATIONS --

When applied to residuals from a fitted model, it checks for model adequacy.

When applied to raw time series data, a significant result (autocorrelation) at seasonal lags may suggest seasonality, but is not conclusive on its own.

Ljung-Box Test Suggesting Seasonality¶

The case of interest:

  1. Downsample to daily data,

  2. Run Ljung-Box test at lag = 365 (1 year),

  3. Getting a significant p-value,

  4. Observing a peak in ACF at lag 365,

Then it supports seasonality at yearly frequency. Namely, it suggests seasonality, but does not provide confirmation. The programming for such:

NOTE: the beneath code is placed in commentary form because autocorrelation related computations can be quite computationally expensive; hourly observations ranging from 2022 to 2025 are applied.

In [313]:
# from statsmodels.stats.diagnostic import acorr_ljungbox
       # Resample to daily rainfall totals
# rain_daily = rain_series.resample('D').sum()
      # Such above reduces noise and lets you test for annual cycles more feasibly.
# results = acorr_ljungbox(rain_daily, lags=[365], return_df=True)
      # The above line to test at annual lag (daily data)
# print(results)
      # If p-value < 0.05, this suggests autocorrelation at the yearly lag.

NOTE: earlier, the seasonal component of the times series was observed, which suggested possible sporadic behaviour. As well, spectral analysis or periodiogram was developed earlier, which conveyed a sluggish increase of unimpressive peaks. Then, running the Ljung-Box Test, and if a p-value < 0.05 is acquired, such would further strengthen the position of declaring sporadic or random riain fall behaviour.

Nevertheless, to continue with MOS development based on feature selection.

Mathematical Structure for the MOS Random Forest Model¶

The adopted base model is a quantile linear regression to predict the target variable $y$ based in input features $X$.

1. Multivariate Quantile Regression Model as the Base Model:

Input Features: $X = [X_1\,X_2\,...,\,X_n]$, where $X_i$ is the $i$-th feature.

Coefficients: $\beta = [\beta_1\,\beta_2\,...,\,\beta_n]$, representing the relationship between each feature and the target.

Prediction Function:

$$Q_y(\tau|X) = \beta_0 + \sum_{i=1}^n \beta_i X_i$$

where:

$Q_y(\tau|X)$ is the conditional $\tau$-quantile of $y$ w.r.t. $X$ and $\beta$ as the base model.

$\beta_0$ is the intercept.

The residuals (errors) are computed as:

$$r_{\text{train}} = y_{\text{train}} - \hat{Q_y}(\text{train})$$$$r_{\text{test}} = y_{\text{test}} - \hat{Q_y}(\text{test})$$

$y_{\text{train}}$ and $y_{\text{test}}$ are the actual observed values for the target training and test data, respectively.

$\hat{Q_y}(\text{train})$ and $\hat{Q_y}(\text{test})$ are the predictions from the base Quantile regression model.

2. Residual Modeling with Random Forest (MOS Model):

To rectify for errors made by the base model, a random forest is trained on the residuals of the base model predictions. The notion is that random forest can capture non-linearities and complex interactions between the features that the regression model will not account for.

Random Forest Model for Residuals:

The input to the random forest model is stil the same feature set $X$, however, the target is now the residuals from the base model:

$$\hat{r} = f_{\text{RF}}(X)$$

where:

$f_{\text{RF}}$ is the random forest function trained to predict residuals $r_{\text{train}}$ on the training data.

Such above model learns a non-linear mapping between the features and the residuals.

3. Final Corrected Prediction:

The final corrected prediction is acquired by addition the residual corrections from the random forest model to the base predictions.

Final Prediction:

$$\hat{Qy}_{\text{final}} = \hat{Qy}_{\text{base}} + \hat{r}$$

where:

$\hat{y}_final$ is the final prediction (with corrections);

$\hat{y}_{\text{base}}$ is the prediction from the base model;

$\hat{r}$ is the correction (residual prediction) from the random forest model.

For the test set, such becomes:

$$\hat{Qy}_{\text{final},\,\text{test}} = \hat{Qy}_{\text{base},\,\text{test}} + \hat{r}_{\text{test}}$$

4. Mean Squared Error (MSE) for Model Evaluation:

Model performance is evaluated using MSE, measuring the average squared distance between the actual values and the predicted values.

Base Model MSE:

$$\text{MSE}_{\text{base}} = \frac{1}{m} \sum_{i=1}^m \left(y_i - \hat{y}_{\text{base}} \right)^2$$

MOS Model (Corrected) MSE:

$$\text{MSE}_{\text{MOS}} = \frac{1}{m} \sum_{i=1}^m \left(y_i - \hat{y}_{\text{final}} \right)^2$$

5. Feature Importance in Random Forest:

The feature importance from the random forest is a measure of how much each feature contributes to reducing the variance of the residuals:

Feature Importance Score: $I(X_i)$ representing how much the feature $X_i$ reduces the model's error. Such can be observed in a bar plot to comprehend which features are the most important for predicting the residuals (namely, the errors the base model missed).

6. Forecasting Using MOS:

Base Forecast:

$$\hat{Qy}_{\text{base}} = \beta_0 + \sum_{i=1}^n \beta_i X_i$$

MOS Corrected Forecast:

$$y_{\text{MOS corrected}} = \hat{Qy}_{\text{base}} + f_{\text{RF}}(X)$$

Such formulation conveys how the multivariate linear regression model forms the basis of prediction, while the random forest provides a second layer of refinement by capturing non-linear behaviors. The overall approach amplifies prediction accuracy by combining the strengths of both models.

In [280]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import QuantileRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Identifying the features and the target
X = MOS_data[['wind_speed_100m', 'relative_humidity_2m', 'surface_pressure',
                             'vapour_pressure_deficit', 'cloud_cover_low']]
y = MOS_data['rain']  # Target as 1D array

# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Base (Quantile Regression) Model
base_model = QuantileRegressor(quantile = 0.5) # for median
base_model.fit(X_train, y_train)

# Generate base predictions on the training and test data
base_train_preds = base_model.predict(X_train)
base_test_preds = base_model.predict(X_test)

# Establish residuals (errors) between actual values and base model predictions 
train_residuals = y_train - base_train_preds
test_residuals = y_test - base_test_preds

# Train the random forest MOS model to predict the residuals
rf_mos = RandomForestRegressor(n_estimators= 100, random_state=42)
rf_mos.fit(X_train, train_residuals)

# NOTE: the test sets should be viewed as new data. 

# Predict the residuals on the test data using the random forest MOS model
mos_residual_corrections = rf_mos.predict(X_test)

# Final corrected predictions = base model predictions + MOS corrections
final_predictions = base_test_preds + mos_residual_corrections

# Evaluate the model using mean squared error (MSE)
base_mse = mean_squared_error(y_test, base_test_preds)
mos_mse = mean_squared_error(y_test, final_predictions)

print(f"Base Regression Model MSE: {base_mse}")
print(f"MOS Random Forest Model Corrected MSE: {mos_mse}")


# Step 11: Visualize Feature Importance for the Random Forest Model
def visualize_feature_importance(model, X):
    if hasattr(model, 'feature_importances_'):
        feature_importance = model.feature_importances_
        features = X.columns
        plt.figure(figsize=(10, 6))
        plt.barh(features, feature_importance, color='skyblue')
        plt.xlabel("Importance")
        plt.ylabel("Features")
        plt.title("Feature Importances from MOS Random Forest Model")
        plt.show()
    else:
        print("The model does not provide feature importances.")

# Visualize feature importance for the Random Forest model
visualize_feature_importance(rf_mos, pd.DataFrame(X_train, columns = X.columns))

# Forecast Using the MOS Model
# Using test data (X_test) as new data for forecasting

# Base forecast using the regression model
base_forecast = base_model.predict(X_test) 

# MOS corrections using the random forest model
mos_corrections = rf_mos.predict(X_test)

# Final forecast (Base forecast + MOS corrections)
final_forecast = base_forecast + mos_corrections

print("Comparing Observed and Final Forecast:")
# Combine into a DataFrame
df = pd.DataFrame({'Observed Data': y_test, 'Final Forecast': final_forecast})
df
Base Regression Model MSE: 0.15434815878494398
MOS Random Forest Model Corrected MSE: 0.1231500823204265
No description has been provided for this image
Comparing Observed and Final Forecast:
Out[280]:
Observed Data Final Forecast
26331 0.1 0.479
2533 0.9 0.191
2929 0.0 0.108
13114 1.1 0.035
6835 0.0 0.006
... ... ...
433 0.1 0.063
30261 0.0 0.002
24617 0.0 0.076
5836 0.0 0.072
22791 0.0 0.000

5193 rows × 2 columns

Ensemble Forecast Models¶

Weather forecasting, once a realm of educated guesswork, has evolved into a complex science aided by powerful computational tools. Among these, ensemble forecast models have emerged as indispensable instruments for predicting weather patterns with greater accuracy and uncertainty quantification.

Ensemble forecasting is a statistical method that involves running multiple simulations of a weather model with slightly different initial conditions and/or model parameters. This approach recognizes the inherent uncertainty in weather prediction, arising from the chaotic nature of atmospheric dynamics and the limitations of observation networks. By generating a range of possible outcomes, ensemble models provide a more comprehensive picture of the potential weather scenarios, allowing forecasters to assess the likelihood of various events and communicate uncertainty effectively.

The key components of an ensemble forecast model include:

  1. Initial Conditions: These are the starting points for each simulation, derived from observations of atmospheric variables like temperature, pressure, humidity, and wind speed at different locations.
  2. Model Physics: The underlying equations that describe the physical processes governing atmospheric behavior, such as advection, convection, radiation, and precipitation.
  3. Perturbations: Small variations introduced to the initial conditions and/or model parameters to create different ensemble members.
  4. Ensemble Size: The number of individual simulations within the ensemble. Larger ensembles generally provide better statistical representation of uncertainty.

Ensemble models offer several advantages over traditional single-run forecasts:

  1. Uncertainty Quantification: By generating a range of possible outcomes, ensemble models provide a measure of the forecast's reliability. This helps forecasters communicate uncertainty effectively to the public and decision-makers.
  2. Improved Skill: Ensemble forecasts often exhibit better skill than single-run forecasts, especially for rare or extreme events. This is because they can capture the variability associated with such events more accurately.
  3. Early Warning: Ensemble models can provide early warnings of potential severe weather events, allowing for timely preparation and mitigation measures.
  4. Climate Applications: Ensemble models are used to study climate variability and change, providing insights into long-term trends and potential impacts.

However, ensemble forecasting is not without its challenges. One limitation is the computational cost associated with running multiple simulations. As models become more complex and the number of ensemble members increases, the computational requirements can be substantial. Additionally, the effectiveness of ensemble forecasts depends on the quality of the initial conditions and the accuracy of the model physics. Errors in either of these can lead to degraded forecast performance.

Despite these challenges, ensemble forecast models have become an essential tool for modern weather prediction. By providing a more comprehensive and probabilistic view of the weather, they help forecasters make informed decisions and communicate uncertainty effectively to the public. As computational capabilities continue to advance, we can expect further improvements in ensemble forecasting, leading to even more accurate and reliable weather predictions.

Various literature (Warner 2010; Muschinski et al 2023) have provided foundations and modelling for the implementation of weather forecast ensemble models. Now, to demonstrate "kindergarten" level development of ensemble models.

An Ensemble Model Based Purely on Random Forests¶

Due to limitations involving the scope of model physics concerning appropriate parameters, boundary conditions, time orientation, and also having reluctance to delve into computational complexity, a pure machine learning environment will now be adopted. Additionally, basic regression models (multi linear or quantile regression) don't seem to be highly suited to the data applied. To now observe the performance of a random forest model by itself.

Data Preparation¶

The goal is to prepare a dataset for the ensemble weather forecast model representating, representing historical weather data as an array. In a professional and constructive environment data structure technologies have been well established. In the Python language environment such concerns incorporation of the NumPy and Pandas libraries for computation and manipulation of data. The idea of an array or matrix considered:

$$X = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1m} \\ x_{21} & x_{22} & \cdots & x_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nm} \end{pmatrix}$$

where $x_{ij}$ represents the $j$-th weather variable (such as temperature, humidity, etc.) at the $i$-th time step. So, columns represent variables (features or predictors), and rows represent time steps or observations through time; datetime format is a standard convention. To strictly identify the configuration as an array rather than a matrix for linear operations because linear relationships between the features are not strongly observed, recalling the Pearson correlation heat map.

Model Development¶

Introduce predictive models $f$ to learn about the residing relationships between the target of interest and the applied features. If $X$ represents input features and $y$ represents the target, then:

$$\hat{y} = f(X, \theta)$$

where $\hat{y}$ is the predicted output, and $\theta$ are the model parameters optimized during training. Supurvised models such as regression and ensemble models such as random forests are trained.

In [283]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import QuantileRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Identifying the features and the target
X = MOS_data[['wind_speed_100m', 'relative_humidity_2m', 'surface_pressure',
                             'vapour_pressure_deficit', 'cloud_cover_low']]
y = MOS_data['rain']  # Target as 1D array

# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    'quantile_regression': QuantileRegressor(),
    'random_forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

# Train models
for name, model in models.items():
    model.fit(X_train, y_train)

Concering the prior development, for random forests, it's an ensemble of decision trees, where, each tree $T_i$ learns a sub-model:

$$\hat{y} = \frac{1}{B}\sum_{i=1}^{B} T_i (X)$$

NOTE: the above is not a linear analytical function.

Generating Ensemble Forecasts¶

Ensemble members are created by pertubing the initial conditions or the features slightly. For a based model $f$:

$$\hat{y}=f(X+\epsilon_i)$$

where $\epsilon_i$ is a small perturbation for ensemble member $i$.

In [286]:
# Create ensemble members
n_members = 10
ensemble_predictions = []

for i in range(n_members):
    # Perturb the test set
    X_test_perturbed = X_test + np.random.normal(0, 0.05, X_test.shape)
    predictions = models['random_forest'].predict(X_test_perturbed)
    ensemble_predictions.append(predictions)

# Convert to numpy array for easier calculations
ensemble_predictions = np.array(ensemble_predictions)

The above script simulates the generation of $n$ ensembles, each identifying a slightly unique atmospheric state by adding noise $\epsilon_i$ to the input features.

Generating Ensemble Forecasts¶

The ensemble mean and spread summarize the ensemble’s central tendency and uncertainty:

$$\bar{X} = \frac{1}{N} \sum_{i=1}^{N} X_i$$$$\sigma = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (X_i - \bar{X})^2}$$
In [289]:
ensemble_mean = np.mean(ensemble_predictions, axis=0)
ensemble_spread = np.std(ensemble_predictions, axis=0)

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(y_test.values, label='Observed', color='black')
plt.plot(ensemble_mean, label='Ensemble Mean', color='blue')
plt.fill_between(range(len(y_test)), 
                 ensemble_mean - ensemble_spread, 
                 ensemble_mean + ensemble_spread, 
                 color='blue', alpha=0.3, label='Ensemble Spread')
plt.legend()
plt.title('Ensemble Forecast with Spread')
plt.show()
No description has been provided for this image

Probabilistic Forecast and Evaluation¶

The CRPS (Continuous Ranked Probability Score) evaluates the ensemble forecast by comparing the distribution of ensemble forecasts against the observed value:

$$\text {CRPS} = \int_{-\infty}^{\infty} (F(X) - H(X))^2 \, dX$$

where

$F(X)$ is the cumulative distrbutiuon function (CDF) of the forecast;

$H(X)$ is the CDF of the observed outcome.

In [291]:
print(y_test.values.shape)
print(ensemble_predictions.shape)
(5193,)
(10, 5193)
In [292]:
ensemble_predictions = ensemble_predictions.T
In [293]:
from properscoring import crps_ensemble

# Compute CRPS
crps = crps_ensemble(y_test.values, ensemble_predictions)
print(f'CRPS: {np.mean(crps)}')
CRPS: 0.09997477047952492

For the above, CRPS assesses how well the probabilistic distribution of the ensemble matches the observed value, indicating the accuracy and reliability of the forecast.

If the forecast perfectly predicts the observed outcome $F(X)$ will be identical to $H(X)$ for all values of $X$. In such as case the CRPS will be 0.

Otherwise the CRPS will be higher. Higher CRPS values conveys are larger discrepancy between the predicted and outcome distributions.

Visualization and Analysis¶

The concluding step is to visualize the distribution of the ensemble members to better comprehend their spread and variability:

In [296]:
plt.figure(figsize=(12, 6))
for i in range(n_members):
    sns.kdeplot(ensemble_predictions[i], alpha=0.3, warn_singular = False)
plt.axvline(y_test.values.mean(), color='black', linestyle='--', label='Observed Mean')
plt.title('Density of Ensemble Members')
plt.legend()
plt.show()
No description has been provided for this image

By plotting the KDE (Kernel Density Estimate) of each ensemble member, we visualize how the forecast probabilities are distributed around the observed mean, which provides insights into the uncertainty and reliability of the forecast.

The Intersection of Data Science and Meteorology: A Powerful Partnership¶

The synergy between data science and meteorology has given rise to a new era of weather forecasting and climate analysis. By leveraging techniques such as data processing, statistical programming, data wrangling, exploratory data analysis, time series analysis, and machine learning, researchers and meteorologists are unlocking valuable insights from vast datasets.

Data processing and wrangling form the foundation of this partnership, ensuring that raw meteorological data is cleaned, standardized, and transformed into a usable format. Statistical programming languages like Python and R provide the tools to manipulate, analyze, and visualize this data effectively. Exploratory data analysis helps identify patterns, trends, and anomalies within the data, guiding further investigations.

Time series analysis is particularly crucial for meteorological data, as it often exhibits temporal dependencies. Time series algorithms like Prophet can capture these dependencies and make accurate predictions. Machine learning algorithms, such as local outlier factor, multilinear regression, quantile regression, logistic regression, random forests, and analysis offer powerful tools for modeling complex relationships between meteorological variables. Additionally, extreme value analysis and survival analysis also have meaningful application with meteorological data.

Conclusion¶

This project has demonstrated the potential of data wrangling, exploratory data analysis (EDA), statistical analysis, stochastic models and machine learning to extract valuable insights from historical meteorological data. By employing a range of such tools and techniques there was ability to visualize trends, uncover hidden relationships and characteristics within the data. Such development provided some foundation to explore temporal dependencies, forecast future trends, climate standing, extreme conditions, probability of outcomes, and develop weather prediction models.

The applied data from government agencies, along with Open-Meteo API and the Kaggle repository proved to be valuable resources for this project, offering a vast repository of high-quality historical weather data.

Overall, this project highlights the importance of leveraging advanced programming and analysis techniques to better understand climate data, patterns and improve our ability to predict future weather events. By applying all such knowledge and skills with robust datasets, one can gain valuable insights that can inform decision-making in various fields, such as agriculture, energy, climate preparedness, and disaster management.

References¶

Anderson, G.B., Bell, M.L. and Peng, R.D. (2013). Methods to Calculate the Heat Index as an Exposure Metric in Environmental Health Research. Environ Health Perspect 121:1111–1119; https://doi.org/10.1289/ehp.1206273

Extreme value analysis. Met Office. (n.d.). https://www.metoffice.gov.uk/services/research-consulting/weather-climate-consultancy/extreme-value-analysis

Forecasting at Scale. Prophet. (n.d.). https://facebook.github.io/prophet/

Goel MK, Khanna P, Kishore J. Understanding Survival Analysis: Kaplan-Meier Estimate. Int J Ayurveda Res. 2010 Oct;1(4):274-8. doi: 10.4103/0974-7788.76794. PMID: 21455458; PMCID: PMC3059453.Hamdi, Y., Haigh, I. D., Parey, S., and Wahl, T.: Preface: Advances in Extreme Value analysis and Application to Natural Hazards, Nat. Hazards Earth Syst. Sci., 21, 1461–1465, https://doi.org/10.5194/nhess-21-1461-2021, 2021.

Hayes, A. (2019). How the Wilcoxon Test Is Used. Investopedia. https://www.investopedia.com/terms/w/wilcoxon-test.asp

Hayes, Adam. (2022). What Is a Times Series and How Is It Used to Analyze Data? Investopedia. https://www.investopedia.com/terms/t/timeseries.asp

Hersbach, H., Bell, B., Berrisford, P., Biavati, G., Horányi, A., Muñoz Sabater, J., Nicolas, J., Peubey, C., Radu, R., Rozum, I., Schepers, D., Simmons, A., Soci, C., Dee, D., Thépaut, J-N. (2023). ERA5 hourly data on single levels from 1940 to present [Data set]. ECMWF. https://doi.org/10.24381/cds.adbb2d47

Historical Hurricane Tracks. Climate Mapping for Resilience and Adaptation. (n.d.). https://resilience.climate.gov/datasets/fedmaps::historical-hurricane-tracks/about

Koenker, R., & José A. F. Machado. (1999). Goodness of Fit and Related Inference Processes for Quantile Regression. Journal of the American Statistical Association, 94(448), 1296–1310. https://doi.org/10.2307/2669943

Koenker, R. and Hallock, K. F. (2001). Quantile Regression. Journal of Economic Perspectives—Volume 15, Number 4—Pages 143–156

Localoutlierfactor. scikit. (n.d.). https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html

Li, X., Marcus, D. and Russell, J. et al. (2024). Weibull Parametric Model for Survival Analysis in Women with Endometrial Cancer using Clinical and T2-Weighted MRI Radiomic Features. BMC Med Res Methodol 24, 107 (2024). https://doi.org/10.1186/s12874-024-02234-1

MacFarland, T.W., Yates, J.M. (2016). Mann–Whitney U Test . In: Introduction to Nonparametric Statistics for the Biological Sciences Using R. Springer, Cham. https://doi.org/10.1007/978-3-319-30634-6_4

MacKinnon, J.G. 1994 “Approximate Asymptotic Distribution Functions for Unit-Root and Cointegration Tests.” Journal of Business & Economics Statistics, 12.2, 167-76.

MacKinnon, J.G. 2010. “Critical Values for Cointegration Tests.” Queen”s University, Dept of Economics Working Papers 1227. http://ideas.repec.org/p/qed/wpaper/1227.html

Muñoz Sabater, J. (2019). ERA5-Land hourly data from 2001 to present [Data set]. ECMWF. https://doi.org/10.24381/CDS.E2161BAC

Muschinski , T. et al (2023). Robust Weather-Adaptive Post-Processing using Model Output Statistics Random Forests. Nonlinear Processes in Geophysics, 30, 503–514. https://doi.org/10.5194/npg-30-503-2023

NCEI. (n.d.). Storm Events Database. National Centers for Environmental Information. https://www.ncdc.noaa.gov/stormevents/listevents.jsp?eventType=ALL&beginDate_mm=09&beginDate_dd=01&beginDate_yyyy=2023&endDate_mm=12&endDate_dd=31&endDate_yyyy=2023&county=ALL&hailfilter=0.00&tornfilter=0&windfilter=000&sort=DT&submitbutton=Search&statefips=36%2CNEW%2BYORK

NEON (National Ecological Observatory Network). Shortwave and Longwave Radiation (Net Radiometer) (DP1.00023.001), Provisional Data. Dataset accessed from https://data.neonscience.org/data-products/DP1.00023.001 on November 24, 2024

NOAA. (2017, January 20). Hurricanes and Typhoons, 1851-2014. Kaggle. https://www.kaggle.com/datasets/noaa/hurricane-database/data

NOAA Predicts Above-Normal 2024 Atlantic Hurricane Season. National Oceanic and Atmospheric Administration. (n.d.). https://www.noaa.gov/news-release/noaa-predicts-above-normal-2024-atlantic-hurricane-season

NWS (National Weather Service). 2011. Meteorological Conversions and Calculations: Heat Index Calculator. Available: https://www.wpc.ncep.noaa.gov/html/heatindex.shtml [accessed 02 October 2024]

Open-Meteo. (2022). Historical Weather API. Historical Weather API. https://open-meteo.com/en/docs/historical-weather-api

Perktold, J., & Seabold, S. (n.d.). Stationarity and Detrending (ADF/KPSS) - statsmodels 0.14.1. https://www.statsmodels.org/stable/examples/notebooks/generated/stationarity_detrending_adf_kpss.html

Perktold, J., & Seabold, S. (n.d.). Statsmodels.tsa.stattools.coint - statsmodels 0.15.0 (+270). https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.coint.htmlerg

Saffir-Simpson Hurricane Wind Scale. (n.d.). https://www.nhc.noaa.gov/aboutsshws.php

Sarmento, D.(n.d.). Chapter 22: Correlation Types and When to Use Them. https://ademos.people.uic.edu/Chapter22.html

Schimanke S., Ridal M., Le Moigne P., Berggren L., Undén P., Randriamampianina R., Andrea U., Bazile E., Bertelsen A., Brousseau P., Dahlgren P., Edvinsson L., El Said A., Glinton M., Hopsch S., Isaksson L., Mladek R., Olsson E., Verrelle A., Wang Z.Q. (2021). CERRA sub-daily regional reanalysis data for Europe on single levels from 1984 to present [Data set]. ECMWF. https://doi.org/10.24381/CDS.622A565A

Stalpers, L. J. A., & Kaplan, E. L. (2018). Edward L. Kaplan and the Kaplan-Meier Survival Curve. BSHM Bulletin: Journal of the British Society for the History of Mathematics, 33(2), 109–135.https://doi.org/10.1080/17498430.2018.1450055

Stigler, M. (2020). Chapter 7 - Nonlinear Time sSeries in R: Threshold cointegration with tsDyn. In: Handbook of Statistics. Elsevier, Volume 42, 2020, Pages 229-264. https://doi.org/10.1016/bs.host.2019.01.008

The Comprehensive R Archive Network. (n.d.). https://cran.r-project.org/web/packages/prophet/vignettes/quick_start.html

Warner, T. T. (2010). Ensemble Methods. In: Numerical Weather and Climate Prediction (pp. 252–283). Chapter 7, Cambridge: Cambridge University Press.

World Health Organization. (n.d.). Types of Pollutants. World Health Organization. https://www.who.int/teams/environment-climate-change-and-health/air-quality-and-health/health-impacts/types-of-pollutants

WMO Meteorological Codes. WMO meteorological codes. (n.d.). https://artefacts.ceda.ac.uk/badc_datadocs/surface/code.html

Zippenfenig, P. (2023). Open-Meteo.com Weather API [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.7970649

5.2 Smoothing Time Series: Stat 510. PennState: Statistics Online Courses. (n.d.). https://online.stat.psu.edu/stat510/lesson/5/5.2

In [ ]: